nips nips2009 nips2009-90 knowledge-graph by maker-knowledge-mining

90 nips-2009-Factor Modeling for Advertisement Targeting


Source: pdf

Author: Ye Chen, Michael Kapralov, John Canny, Dmitry Y. Pavlov

Abstract: We adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson) [6], to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. We also approach the important problem of ad positional bias by formulating a one-latent-dimension GaP factorization. Learning from click-through data is intrinsically large scale, even more so for ads. We scale up the algorithm to terabytes of real-world SS and BT data that contains hundreds of millions of users and hundreds of thousands of features, by leveraging the scalability characteristics of the algorithm and the inherent structure of the problem including data sparsity and locality. SpeciÄ?Ĺš cally, we demonstrate two somewhat orthogonal philosophies of scaling algorithms to large-scale problems, through the SS and BT implementations, respectively. Finally, we report the experimental results using Yahoo’s vast datasets, and show that our approach substantially outperform the state-of-the-art methods in prediction accuracy. For BT in particular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach using Poisson regression [11] yielded 0.83. For computational performance, we compare a single-node sparse implementation with a parallel implementation using Hadoop MapReduce, the results are counterintuitive yet quite interesting. We therefore provide insights into the underlying principles of large-scale learning. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson) [6], to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. [sent-8, score-0.553]

2 We also approach the important problem of ad positional bias by formulating a one-latent-dimension GaP factorization. [sent-9, score-0.427]

3 The objective of ad targeting is to select most relevant ads to present to a user based on contextual and prior knowledge about this user. [sent-29, score-0.781]

4 For instance, sponsored search (SS) [17] uses query, content match [5] relies on page content, and behavioral targeting (BT) [11] leverages historical user behavior. [sent-31, score-0.677]

5 Nevertheless, the training data can be generally formed as a user-feature matrix of event counts, where the feature dimension contains various events such as queries, ad clicks and views. [sent-32, score-0.484]

6 In general, the goal of latent variable models is to discover statistical structures (factors) latent in the data, often with dimensionality reduction, and thus to generalize well to unseen examples. [sent-34, score-0.186]

7 1 Sponsored search involves placing textual ads related to the user query alongside the algorithmic search results. [sent-39, score-0.503]

8 To estimate ad relevance, previous approaches include similarity search [5], logistic regression [25, 8], classiÄ? [sent-40, score-0.29]

9 We consider the problem of estimating CTR of the form p(click|ad, user, query), through a factorization of the user-feature matrix into a latent factor space, as derived in Section 2. [sent-42, score-0.174]

10 Behavioral targeting leverages historical user behavior to select relevant ads to display. [sent-46, score-0.562]

11 The question addressed by the state-of-the-art BT is instead that of predicting the CTR of an ad in a given category (e. [sent-49, score-0.282]

12 Let F be an n Ä‚— m data matrix whose element fij is the observed count of event (or feature) i by user j. [sent-58, score-0.53]

13 Let X be a d Ä‚— m matrix where the column vector xj is a low-dimensional representation of user j in a latent space of “topicsâ€? [sent-63, score-0.444]

14 of user j to topic k as the total number of occurrences of all events contributing to topic k. [sent-67, score-0.366]

15 ĂŽ› is an nÄ‚—d matrix where the column ĂŽ›k represents the kth topic as a vector of event probabilities p(i|k), that is, a multinomial distribution of event counts conditioned on topic k. [sent-68, score-0.228]

16 The approximation has an appealing interpretation column-wise f ≈ ĂŽ›x, that is, each user vector f in event space is approximated by a linear combination of the column vectors of ĂŽ›, weighted by the topical mixture x for that user. [sent-73, score-0.352]

17 Finally, xkj is given a gamma distribution as an empirical prior. [sent-76, score-0.239]

18 The generative process of an observed event-user count fij follows: 1. [sent-77, score-0.185]

19 , yij = ĂŽ›i xj where ĂŽ›i is the ith row vector of ĂŽ›. [sent-82, score-0.123]

20 Next, from the latent random vector characterizing a user x, we derive the expected count vector y for the user as follows: y = ĂŽ›x. [sent-87, score-0.704]

21 The data likelihood for a user generated as described above is n f yi i exp(−yi ) fi ! [sent-90, score-0.279]

22 xkj ~ Gamma(ĂŽÄ…k,ĂŽË›k) query-ad: ‘machine+learning+8532948011’ Figure 1: GaP graphical model where yi = i x. [sent-93, score-0.205]

23 )+ i [( 1) log xk k xk k k log( k) log ( k )] (5) k Given a corpus of user data F = (f1 fj fm ), we wish to Ä? [sent-95, score-0.351]

24 Based on an elegant multiplicative recurrence developed by Lee and Seung [22] for NMF, the following EM algorithm was derived in [6]: (fij ik yij ) + ( k 1) xkj xkj i (6) E-step: xkj k i ik + 1 M-step: j ik fij xkj y ij ik j 2. [sent-97, score-1.322]

25 1 xkj (7) Two variants for CTR prediction The standard GaP model Ä? [sent-98, score-0.257]

26 The second idea is to consider the relative frequency of counts, particularly the number of clicks relative to the number of views for the events of interest. [sent-103, score-0.224]

27 Formally, let F be a matrix of observed click counts and Y be a matrix of the corresponding expected click counts. [sent-104, score-0.822]

28 We further introduce a matrix of observed views V and a matrix of click probabilities Z, and deÄ? [sent-105, score-0.491]

29 The linear predictor Z = X now estimates CTR directly, and is scaled by the observed view counts V to obtain the expected number of clicks Y . [sent-108, score-0.269]

30 The Poisson assumption is only given to the click events F with the mean parameters Y . [sent-109, score-0.384]

31 Given a number of views v and the probability of click for a single view or CTR, a more natural stochastic model for click counts is Binomial(v CTR). [sent-110, score-0.922]

32 (9), we derive the following EM recurrence: (fij ik zij ) + ( k 1) xkj E-step: xkj (10) xkj i k i (vij ik ) + 1 j (fij xkj z ij ) M-step: (11) ik ik j (vij xkj ) 3 2. [sent-115, score-1.256]

33 In LDA, the choice of latent factor is made independently word-by-word, or in the BT case, ad view by ad view. [sent-133, score-0.699]

34 The scale factors provide an estimated count of the number of items drawn from each latent factor. [sent-138, score-0.17]

35 Ĺš cally the dimensionality and the user data locality, the deployment of GaP for SS involves three processes: (1) ofÄ? [sent-149, score-0.315]

36 First, given the observed click counts F and view counts V obtained from a corpus of historical user data, we derive ĂŽ› and X using the CTR-based GaP algorithm in Eqs. [sent-156, score-0.936]

37 Other features can also be added based on their predicting capabilities, such as query term, linead term, ad group, and match type. [sent-162, score-0.387]

38 In order for the model matrix ĂŽ› to capture the corpus-wide topical structure, the entire user corpus should be used as training set. [sent-165, score-0.364]

39 Second, given the derived model matrix ĂŽ›, we update the user proÄ? [sent-169, score-0.305]

40 To ensure the model to capture the latest user behavioral pattern and to have high coverage of users, one needs to refresh the model often, e. [sent-173, score-0.358]

41 Once a model is trained on a full corpus of user data, it sufÄ? [sent-181, score-0.305]

42 ĂŽ› contains the global information of latent topics in the form 4 of feature mixtures. [sent-183, score-0.129]

43 Note that this bucketization is exactly how production ad serving works. [sent-185, score-0.258]

44 (10), the update rule for a given user xj only involves the data for that user and the global ĂŽ›. [sent-188, score-0.604]

45 Notice that the multiplicative factor in E-step depends on xkj , the parameter being updated, thus consecutive E-steps will indeed advance convergence. [sent-191, score-0.258]

46 Finally, given the global ĂŽ› and a local X learned and stored in each server, the expected CTR for a user given a QL pair or p(click|QL, user) is computed online as follows. [sent-193, score-0.311]

47 Suppose a user issues a query, a candidate set of lineads is retrieved by applying various matching algorithms. [sent-194, score-0.307]

48 One then extracts the row vectors from ĂŽ› corresponding to the candidate QL set to form a smaller block ĂŽ›mat , and looks up the column vector xj for that user from X. [sent-196, score-0.325]

49 2 Positional normalization Our analysis so far has been abstracted from another essential factor, that is, the position of an ad impression on a search result page. [sent-200, score-0.402]

50 It is known intuitively and empirically that ad position has a signiÄ? [sent-201, score-0.322]

51 In this section we treat the positional effect in a statistically sound manner. [sent-203, score-0.169]

52 to a same presentation position, in order to capture the probability of click regardless of where the impression is shown. [sent-208, score-0.399]

53 To achieve positional normalization, we assume the following Markov chain: (1) viewing an ad given its position, and then (2) clicking the ad given a user actually views the ad; thus p(click|position) = p(click|view)p(view|position), (12) where “view� [sent-209, score-1.1]

54 is the event of a user voluntarily examining an ad, instead of an ad impression itself. [sent-210, score-0.625]

55 As it turns out, to estimate the positional prior p(view|position) we can apply a special GaP factorization with one inner dimension. [sent-213, score-0.231]

56 First, the GaP algorithm for estimating positional priors is run on the observed click and view counts of (feature, position) pairs. [sent-218, score-0.652]

57 This yields a row vector of positional priors xpos . [sent-219, score-0.197]

58 In model training, each ad view occurrence is then normalized (multiplied) by the prior p(view|position) for the position where the ad is presented. [sent-220, score-0.644]

59 , ov-top+1 in Yahoo’s terminology meaning the North 1 position in sponsored results) is typically higher than that of an obscure position (e. [sent-223, score-0.205]

60 An observed count of views placed in ov-top+1 thus has a greater normalized count than that in ov-bottom+2. [sent-226, score-0.194]

61 This normalization effectively asserts that, given a same observed (unnormalized) CTR, an ad shown in an inferior position has a higher click probability per se than the one placed in a more obvious position. [sent-227, score-0.673]

62 In online prediction, however, we need CTR estimates unbiased by positional effect in order for the matching ads to be ranked based on their qualities (clickabilities). [sent-230, score-0.311]

63 For an intuitive interpretation, if we scale positional priors so that the top position has a prior of 1, i. [sent-233, score-0.233]

64 , xpos ov-top+1 = 1, all ads are normalized to that top position. [sent-235, score-0.138]

65 Another view of the positional prior model we use is an examination model [25], that is, the probability of clicking on an ad is the product of a positional probability and a relevance-based probability which is independent of position. [sent-236, score-0.708]

66 This model is not dependent on the content of ads higher up on the search page, as for example the cascade [14] or DBN models [9]. [sent-238, score-0.174]

67 These models are appropriate 5 for search results where users have a high probability of clicking on one of the links. [sent-239, score-0.145]

68 However, for ads, the probability of clicking on ad links is extremely low, usually a fraction of a percent. [sent-240, score-0.306]

69 Thus the effects of higher ads is a product of factors which are extremely close to one. [sent-241, score-0.134]

70 In this case, the DBN positional prior reduces to a negative exponential function which is a good Ä? [sent-242, score-0.169]

71 Both summing terms (fij xkj /z ij and vij xkj ) only requires data that is available locally (in memory) right after the E-step for user j. [sent-258, score-0.718]

72 As thus arranged, an iteration of 3-10 E-steps combined with one M-step only requires a single pass over the user corpus. [sent-260, score-0.279]

73 Note that the inner loops of both E-step and M-step involve calculating the ratio fij /zij . [sent-263, score-0.165]

74 Since f is a count of very rare click events, one only needs to compute z when the corresponding f is non-zero. [sent-264, score-0.404]

75 Let Nc be the total number of non-zero f terms or distinct click events over all users. [sent-265, score-0.384]

76 For each non-zero fij , computing zij = ĂŽ›i xj dot-product takes d multiplications. [sent-266, score-0.209]

77 Suppose the entire Yahoo’s user base of SS contains about 200 million users. [sent-279, score-0.279]

78 Further assume 100 distinct ad views on average per user and an inner dimension of 10, thus the total number of operations is 4 Ä‚— 1010 for one iteration. [sent-281, score-0.682]

79 The dataset was obtained from 32 buckets of users and covering a one-month period, where the Ä? [sent-296, score-0.127]

80 Ĺš ltered out users with a total event count below 10, which gave 1. [sent-300, score-0.158]

81 2 across all latent topics for model training; and used a near-zero prior for positional prior estimation. [sent-306, score-0.298]

82 Ĺš ned as the ratio of the observed clicks to the expected clicks [9]), and (2) historical QL CTR normalized by position. [sent-308, score-0.286]

83 A click-view ROC curve plots the click recall vs. [sent-310, score-0.351]

84 82 or 2% improvement over historical QL CTR, and a 68% average CTR lift over Panama score at the 5-20% view recall range. [sent-316, score-0.228]

85 9 1 View recall (a) ROC plots (b) Pairwise CTR lift Figure 3: Model performance comparison among (1) GaP using QL-QT-LT, (2) Panama score predictor, and (3) historical QL-CTR predictor. [sent-335, score-0.164]

86 The number of clicks and views for a given ad are predicted separately and a CTR estimator is constructed as in Eq. [sent-348, score-0.449]

87 1 Prediction with different granularity We form the data matrix F from historical user behavioral data at the granular level, including click and view counts for individual ads, as well as other explanatory variable features such as page views. [sent-354, score-1.021]

88 Ĺš ciently by exploiting user data locality, particularly in the MapReduce [15] framework. [sent-366, score-0.279]

89 2 Experiments The data matrix F was formed to contain rows for all ad clicks and views, as well as page views with frequency above a threshold of 100. [sent-370, score-0.504]

90 The counts were aggregated over a two-week period of time and from 32 buckets of users. [sent-371, score-0.153]

91 This setup resulted in 170K features comprised of 120K ad clicks or views, and 50K page views, which allows the model matrix ĂŽ› to Ä? [sent-372, score-0.416]

92 We also measured the effect of the latent dimension on the model performance by varying d = 10 to 100, and observed that per-ad prediction is insensitive to the latent dimension with all ROC areas in the range of [95%, 96%], whereas per-category prediction beneÄ? [sent-384, score-0.338]

93 Finally, to verify the scalability of our parallel implementation, we increased the size of training data from 32 to 512 user buckets. [sent-386, score-0.305]

94 number of user buckets Number of buckets 32 64 128 512 Run-time (hours) 11. [sent-390, score-0.403]

95 3 (after accounting for the different factors of users 4Ä‚—, latent dimension 2Ä‚—, and EM iterations 13/15), with a similar 100 MÄ? [sent-396, score-0.206]

96 GaP is also a smoothing algorithm, which yields smoothed click prediction. [sent-406, score-0.351]

97 Moreover, GaP builds personalization into ad targeting, by proÄ? [sent-408, score-0.258]

98 GaP factorization with one inner dimension gives a statistically sound approach to estimating the positional prior. [sent-413, score-0.255]

99 Finally, the GaP-derived latent low-dimensional representation of user can be used as a valuable input to other applications and products, such as user clustering, collaborative Ä? [sent-414, score-0.651]

100 A dynamic Bayesian network click model for web search ranking. [sent-468, score-0.416]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ctr', 0.564), ('click', 0.351), ('user', 0.279), ('ad', 0.258), ('xkj', 0.205), ('gap', 0.199), ('positional', 0.169), ('fij', 0.132), ('ql', 0.11), ('ads', 0.11), ('clicks', 0.103), ('ss', 0.099), ('panama', 0.096), ('targeting', 0.093), ('latent', 0.093), ('views', 0.088), ('bt', 0.087), ('lift', 0.084), ('historical', 0.08), ('yij', 0.077), ('sponsored', 0.077), ('granular', 0.069), ('counts', 0.068), ('recurrence', 0.066), ('users', 0.065), ('position', 0.064), ('view', 0.064), ('roc', 0.062), ('advertising', 0.062), ('buckets', 0.062), ('poisson', 0.062), ('nv', 0.056), ('behavioral', 0.055), ('linead', 0.055), ('count', 0.053), ('prediction', 0.052), ('ik', 0.05), ('query', 0.05), ('clicking', 0.048), ('impression', 0.048), ('xj', 0.046), ('contextual', 0.041), ('cookie', 0.041), ('pavlov', 0.041), ('event', 0.04), ('ine', 0.04), ('yahoo', 0.038), ('www', 0.037), ('pro', 0.037), ('deployment', 0.036), ('hadoop', 0.036), ('mapreduce', 0.036), ('topics', 0.036), ('lda', 0.035), ('gamma', 0.034), ('predictor', 0.034), ('inner', 0.033), ('web', 0.033), ('topical', 0.033), ('events', 0.033), ('online', 0.032), ('content', 0.032), ('search', 0.032), ('patent', 0.031), ('zij', 0.031), ('locality', 0.03), ('arithmetic', 0.029), ('vij', 0.029), ('yielded', 0.029), ('factorization', 0.029), ('page', 0.029), ('updating', 0.028), ('interests', 0.028), ('parallelization', 0.028), ('em', 0.028), ('berkhin', 0.028), ('ctrs', 0.028), ('hashmap', 0.028), ('kapralov', 0.028), ('lineads', 0.028), ('qls', 0.028), ('qts', 0.028), ('xpos', 0.028), ('topic', 0.027), ('multiplicative', 0.027), ('matrix', 0.026), ('factor', 0.026), ('corpus', 0.026), ('nc', 0.026), ('scalability', 0.026), ('multiplication', 0.026), ('implementation', 0.025), ('dimension', 0.024), ('canny', 0.024), ('ops', 0.024), ('refresh', 0.024), ('factors', 0.024), ('predicting', 0.024), ('period', 0.023), ('xk', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 90 nips-2009-Factor Modeling for Advertisement Targeting

Author: Ye Chen, Michael Kapralov, John Canny, Dmitry Y. Pavlov

Abstract: We adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson) [6], to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. We also approach the important problem of ad positional bias by formulating a one-latent-dimension GaP factorization. Learning from click-through data is intrinsically large scale, even more so for ads. We scale up the algorithm to terabytes of real-world SS and BT data that contains hundreds of millions of users and hundreds of thousands of features, by leveraging the scalability characteristics of the algorithm and the inherent structure of the problem including data sparsity and locality. SpeciÄ?Ĺš cally, we demonstrate two somewhat orthogonal philosophies of scaling algorithms to large-scale problems, through the SS and BT implementations, respectively. Finally, we report the experimental results using Yahoo’s vast datasets, and show that our approach substantially outperform the state-of-the-art methods in prediction accuracy. For BT in particular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach using Poisson regression [11] yielded 0.83. For computational performance, we compare a single-node sparse implementation with a parallel implementation using Hadoop MapReduce, the results are counterintuitive yet quite interesting. We therefore provide insights into the underlying principles of large-scale learning. 1

2 0.12040316 181 nips-2009-Online Learning of Assignments

Author: Matthew Streeter, Daniel Golovin, Andreas Krause

Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1

3 0.10120936 125 nips-2009-Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data

Author: Shuai Huang, Jing Li, Liang Sun, Jun Liu, Teresa Wu, Kewei Chen, Adam Fleisher, Eric Reiman, Jieping Ye

Abstract: Recent advances in neuroimaging techniques provide great potentials for effective diagnosis of Alzheimer’s disease (AD), the most common form of dementia. Previous studies have shown that AD is closely related to the alternation in the functional brain network, i.e., the functional connectivity among different brain regions. In this paper, we consider the problem of learning functional brain connectivity from neuroimaging, which holds great promise for identifying image-based markers used to distinguish Normal Controls (NC), patients with Mild Cognitive Impairment (MCI), and patients with AD. More specifically, we study sparse inverse covariance estimation (SICE), also known as exploratory Gaussian graphical models, for brain connectivity modeling. In particular, we apply SICE to learn and analyze functional brain connectivity patterns from different subject groups, based on a key property of SICE, called the “monotone property” we established in this paper. Our experimental results on neuroimaging PET data of 42 AD, 116 MCI, and 67 NC subjects reveal several interesting connectivity patterns consistent with literature findings, and also some new patterns that can help the knowledge discovery of AD. 1 In trod u cti on Alzheimer’s disease (AD) is a fatal, neurodegenerative disorder characterized by progressive impairment of memory and other cognitive functions. It is the most common form of dementia and currently affects over five million Americans; this number will grow to as many as 14 million by year 2050. The current knowledge about the cause of AD is very limited; clinical diagnosis is imprecise with definite diagnosis only possible by autopsy; also, there is currently no cure for AD, while most drugs only alleviate the symptoms. To tackle these challenging issues, the rapidly advancing neuroimaging techniques provide great potentials. These techniques, such as MRI, PET, and fMRI, produce data (images) of brain structure and function, making it possible to identify the difference between AD and normal brains. Recent studies have demonstrated that neuroimaging data provide more sensitive and consistent measures of AD onset and progression than conventional clinical assessment and neuropsychological tests [1]. Recent studies have found that AD is closely related to the alternation in the functional brain network, i.e., the functional connectivity among different brain regions [ 2]-[3]. Specifically, it has been shown that functional connectivity substantially decreases between the hippocampus and other regions of AD brains [3]-[4]. Also, some studies have found increased connectivity between the regions in the frontal lobe [ 6]-[7]. Learning functional brain connectivity from neuroimaging data holds great promise for identifying image-based markers used to distinguish among AD, MCI (Mild Cognitive Impairment), and normal aging. Note that MCI is a transition stage from normal aging to AD. Understanding and precise diagnosis of MCI have significant clinical value since it can serve as an early warning sign of AD. Despite all these, existing research in functional brain connectivity modeling suffers from limitations. A large body of functional connectivity modeling has been based on correlation analysis [2]-[3], [5]. However, correlation only captures pairwise information and fails to provide a complete account for the interaction of many (more than two) brain regions. Other multivariate statistical methods have also been used, such as Principle Component Analysis (PCA) [8], PCA-based Scaled Subprofile Model [9], Independent Component Analysis [10]-[11], and Partial Least Squares [12]-[13], which group brain regions into latent components. The brain regions within each component are believed to have strong connectivity, while the connectivity between components is weak. One major drawback of these methods is that the latent components may not correspond to any biological entities, causing difficulty in interpretation. In addition, graphical models have been used to study brain connectivity, such as structural equation models [14]-[15], dynamic causal models [16], and Granger causality. However, most of these approaches are confirmative, rather than exploratory, in the sense that they require a prior model of brain connectivity to begin with. This makes them inadequate for studying AD brain connectivity, because there is little prior knowledge about which regions should be involved and how they are connected. This makes exploratory models highly desirable. In this paper, we study sparse inverse covariance estimation (SICE), also known as exploratory Gaussian graphical models, for brain connectivity modeling. Inverse covariance matrix has a clear interpretation that the off-diagonal elements correspond to partial correlations, i.e., the correlation between each pair of brain regions given all other regions. This provides a much better model for brain connectivity than simple correlation analysis which models each pair of regions without considering other regions. Also, imposing sparsity on the inverse covariance estimation ensures a reliable brain connectivity to be modeled with limited sample size, which is usually the case in AD studies since clinical samples are difficult to obtain. From a domain perspective, imposing sparsity is also valid because neurological findings have demonstrated that a brain region usually only directly interacts with a few other brain regions in neurological processes [ 2]-[3]. Various algorithms for achieving SICE have been developed in recent year [ 17]-[22]. In addition, SICE has been used in various applications [17], [21], [23]-[26]. In this paper, we apply SICE to learn functional brain connectivity from neuroimaging and analyze the difference among AD, MCI, and NC based on a key property of SICE, called the “monotone property” we established in this paper. Unlike the previous study which is based on a specific level of sparsity [26], the monotone property allows us to study the connectivity pattern using different levels of sparsity and obtain an order for the strength of connection between pairs of brain regions. In addition, we apply bootstrap hypothesis testing to assess the significance of the connection. Our experimental results on PET data of 42 AD, 116 MCI, and 67 NC subjects enrolled in the Alzheimer’s Disease Neuroimaging Initiative project reveal several interesting connectivity patterns consistent with literature findings, and also some new patterns that can help the knowledge discovery of AD. 2 S ICE : B ack grou n d an d th e Mon oton e P rop erty An inverse covariance matrix can be represented graphically. If used to represent brain connectivity, the nodes are activated brain regions; existence of an arc between two nodes means that the two brain regions are closely related in the brain's functiona l process. Let be all the brain regions under study. We assume that follows a multivariate Gaussian distribution with mean and covariance matrix . Let be the inverse covariance matrix. Suppose we have samples (e.g., subjects with AD) for these brain regions. Note that we will only illustrate here the SICE for AD, whereas the SICE for MCI and NC can be achieved in a similar way. We can formulate the SICE into an optimization problem, i.e., (1) where is the sample covariance matrix; , , and denote the determinant, trace, and sum of the absolute values of all elements of a matrix, respectively. The part “ ” in (1) is the log-likelihood, whereas the part “ ” represents the “sparsity” of the inverse covariance matrix . (1) aims to achieve a tradeoff between the likelihood fit of the inverse covariance estimate and the sparsity. The tradeoff is controlled by , called the regularization parameter; larger will result in more sparse estimate for . The formulation in (1) follows the same line of the -norm regularization, which has been introduced into the least squares formulation to achieve model sparsity and the resulting model is called Lasso [27]. We employ the algorithm in [19] in this paper. Next, we show that with going from small to large, the resulting brain connectivity models have a monotone property. Before introducing the monotone property, the following definitions are needed. Definition: In the graphical representation of the inverse covariance, if node to by an arc, then is called a “neighbor” of . If is connected to chain of arcs, then is called a “connectivity component” of . is connected though some Intuitively, being neighbors means that two nodes (i.e., brain regions) are directly connected, whereas being connectivity components means that two brain regions are indirectly connected, i.e., the connection is mediated through other regions. In other words, not being connectivity components (i.e., two nodes completely separated in the graph) means that the two corresponding brain regions are completely independent of each other. Connectivity components have the following monotone property: Monotone property of SICE: Let components of with and and be the sets of all the connectivity , respectively. If , then . Intuitively, if two regions are connected (either directly or indirectly) at one level of sparseness ( ), they will be connected at all lower levels of sparseness ( ). Proof of the monotone property can be found in the supplementary file [29]. This monotone property can be used to identify how strongly connected each node (brain region) to its connectivity components. For example, assuming that and , this means that is more strongly connected to than . Thus, by changing from small to large, we can obtain an order for the strength of connection between pairs of brain regions. As will be shown in Section 3, this order is different among AD, MCI, and NC. 3 3.1 Ap p l i cati on i n B rai n Con n ecti vi ty M od el i n g of AD D a t a a c q u i s i t i o n a n d p re p ro c e s s i n g We apply SICE on FDG-PET images for 49 AD, 116 MCI, and 67 NC subjects downloaded from the ADNI website. We apply Automated Anatomical Labeling (AAL) [28] to extract data from each of the 116 anatomical volumes of interest (AVOI), and derived average of each AVOI for every subject. The AVOIs represent different regions of the whole brain. 3.2 B r a i n c o n n e c t i v i t y mo d e l i n g b y S I C E 42 AVOIs are selected for brain connectivity modeling, as they are considered to be potentially related to AD. These regions distribute in the frontal, parietal, occipital, and temporal lobes. Table 1 list of the names of the AVOIs with their corresponding lobes. The number before each AVOI is used to index the node in the connectivity models. We apply the SICE algorithm to learn one connectivity model for AD, one for MCI, and one for NC, for a given . With different ’s, the resulting connectivity models hold a monotone property, which can help obtain an order for the strength of connection between brain regions. To show the order clearly, we develop a tree-like plot in Fig. 1, which is for the AD group. To generate this plot, we start at a very small value (i.e., the right-most of the horizontal axis), which results in a fully-connected connectivity model. A fully-connected connectivity model is one that contains no region disconnected with the rest of the brain. Then, we decrease by small steps and record the order of the regions disconnected with the rest of the brain regions. Table 1: Names of the AVOIs for connectivity modeling (“L” means that the brain region is located at the left hemisphere; “R” means right hemisphere.) Frontal lobe Parietal lobe Occipital lobe Temporal lobe 1 Frontal_Sup_L 13 Parietal_Sup_L 21 Occipital_Sup_L 27 T emporal_Sup_L 2 Frontal_Sup_R 14 Parietal_Sup_R 22 Occipital_Sup_R 28 T emporal_Sup_R 3 Frontal_Mid_L 15 Parietal_Inf_L 23 Occipital_Mid_L 29 T emporal_Pole_Sup_L 4 Frontal_Mid_R 16 Parietal_Inf_R 24 Occipital_Mid_R 30 T emporal_Pole_Sup_R 5 Frontal_Sup_Medial_L 17 Precuneus_L 25 Occipital_Inf_L 31 T emporal_Mid_L 6 Frontal_Sup_Medial_R 18 Precuneus_R 26 Occipital_Inf_R 32 T emporal_Mid_R 7 Frontal_Mid_Orb_L 19 Cingulum_Post_L 33 T emporal_Pole_Mid_L 8 Frontal_Mid_Orb_R 20 Cingulum_Post_R 34 T emporal_Pole_Mid_R 9 Rectus_L 35 T emporal_Inf_L 8301 10 Rectus_R 36 T emporal_Inf_R 8302 11 Cingulum_Ant_L 37 Fusiform_L 12 Cingulum_Ant_R 38 Fusiform_R 39 Hippocampus_L 40 Hippocampus_R 41 ParaHippocampal_L 42 ParaHippocampal_R For example, in Fig. 1, as decreases below (but still above ), region “Tempora_Sup_L” is the first one becoming disconnected from the rest of the brain. As decreases below (but still above ), the rest of the brain further divides into three disconnected clusters, including the cluster of “Cingulum_Post_R” and “Cingulum_Post_L”, the cluster of “Fusiform_R” up to “Hippocampus_L”, and the cluster of the other regions. As continuously decreases, each current cluster will split into smaller clusters; eventually, when reaches a very large value, there will be no arc in the IC model, i.e., each region is now a cluster of itself and the split will stop. The sequence of the splitting gives an order for the strength of connection between brain regions. Specifically, the earlier (i.e., smaller ) a region or a cluster of regions becomes disconnected from the rest of the brain, the weaker it is connected with the rest of the brain. For example, in Fig. 1, it can be known that “Tempora_Sup_L” may be the weakest region in the brain network of AD; the second weakest ones are the cluster of “Cingulum_Post_R” and “Cingulum_Post_L”, and the cluster of “Fusiform_R” up to “Hippocampus_L”. It is very interesting to see that the weakest and second weakest brain regions in the brain network include “Cingulum_Post_R” and “Cingulum_Post_L” as well as regions all in the temporal lobe, all of which have been found to be affected by AD early and severely [3]-[5]. Next, to facilitate the comparison between AD and NC, a tree-like plot is also constructed for NC, as shown in Fig. 2. By comparing the plots for AD and NC, we can observe the following two distinct phenomena: First, in AD, between-lobe connectivity tends to be weaker than within-lobe connectivity. This can be seen from Fig. 1 which shows a clear pattern that the lobes become disconnected with each other before the regions within each lobe become disconnected with each other, as goes from small to large. This pattern does not show in Fig. 2 for NC. Second, the same brain regions in the left and right hemisphere are connected much weaker in AD than in NC. This can be seen from Fig. 2 for NC, in which the same brain regions in the left and right hemisphere are still connected even at a very large for NC. However, this pattern does not show in Fig. 1 for AD. Furthermore, a tree-like plot is also constructed for MCI (Fig. 3), and compared with the plots for AD and NC. In terms of the two phenomena discussed previously, MCI shows similar patterns to AD, but these patterns are not as distinct from NC as AD. Specifically, in terms of the first phenomenon, MCI also shows weaker between-lobe connectivity than within-lobe connectivity, which is similar to AD. However, the degree of weakerness is not as distinctive as AD. For example, a few regions in the temporal lobe of MCI, including “Temporal_Mid_R” and “Temporal_Sup_R”, appear to be more strongly connected with the occipital lobe than with other regions in the temporal lobe. In terms of the second phenomenon, MCI also shows weaker between-hemisphere connectivity in the same brain region than NC. However, the degree of weakerness is not as distinctive as AD. For example, several left-right pairs of the same brain regions are still connected even at a very large , such as “Rectus_R” and “Rectus_L”, “Frontal_Mid_Orb_R” and “Frontal_Mid_Orb _L”, “Parietal_Sup_R” and “Parietal_Sup_L”, as well as “Precuneus_R” and “Precuneus_L”. All above findings are consistent with the knowledge that MCI is a transition stage between normal aging and AD. Large λ λ3 λ2 λ1 Small λ Fig 1: Order for the strength of connection between brain regions of AD Large λ Small λ Fig 2: Order for the strength of connection between brain regions of NC Fig 3: Order for the strength of connection between brain regions of MCI Furthermore, we would like to compare how within-lobe and between-lobe connectivity is different across AD, MCI, and NC. To achieve this, we first learn one connectivity model for AD, one for MCI, and one for NC. We adjust the in the learning of each model such that the three models, corresponding to AD, MCI, and NC, respectively, will have the same total number of arcs. This is to “normalize” the models, so that the comparison will be more focused on how the arcs distribute differently across different models. By selecting different values for the total number of arcs, we can obtain models representing the brain connectivity at different levels of strength. Specifically, given a small value for the total number of arcs, only strong arcs will show up in the resulting connectivity model, so the model is a model of strong brain connectivity; when increasing the total number of arcs, mild arcs will also show up in the resulting connectivity model, so the model is a model of mild and strong brain connectivity. For example, Fig. 4 shows the connectivity models for AD, MCI, and NC with the total number of arcs equal to 50 (Fig. 4(a)), 120 (Fig. 4(b)), and 180 (Fig. 4(c)). In this paper, we use a “matrix” representation for the SICE of a connectivity model. In the matrix, each row represents one node and each column also represents one node. Please see Table 1 for the correspondence between the numbering of the nodes and the brain region each number represents. The matrix contains black and white cells: a black cell at the -th row, -th column of the matrix represents existence of an arc between nodes and in the SICE-based connectivity model, whereas a white cell represents absence of an arc. According to this definition, the total number of black cells in the matrix is equal to twice the total number of arcs in the SICE-based connectivity model. Moreover, on each matrix, four red cubes are used to highlight the brain regions in each of the four lobes; that is, from top-left to bottom-right, the red cubes highlight the frontal, parietal, occipital, and temporal lobes, respectively. The black cells inside each red cube reflect within-lobe connectivity, whereas the black cells outside the cubes reflect between-lobe connectivity. While the connectivity models in Fig. 4 clearly show some connectivity difference between AD, MCI, and NC, it is highly desirable to test if the observed difference is statistically significant. Therefore, we further perform a hypothesis testing and the results are summarized in Table 2. Specifically, a P-value is recorded in the sub-table if it is smaller than 0.1, such a P-value is further highlighted if it is even smaller than 0.05; a “---” indicates that the corresponding test is not significant (P-value>0.1). We can observe from Fig. 4 and Table 2: Within-lobe connectivity: The temporal lobe of AD has significantly less connectivity than NC. This is true across different strength levels (e.g., strong, mild, and weak) of the connectivity; in other words, even the connectivity between some strongly-connected brain regions in the temporal lobe may be disrupted by AD. In particular, it is clearly from Fig. 4(b) that the regions “Hippocampus” and “ParaHippocampal” (numbered by 39-42, located at the right-bottom corner of Fig. 4(b)) are much more separated from other regions in AD than in NC. The decrease in connectivity in the temporal lobe of AD, especially between the Hippocampus and other regions, has been extensively reported in the literature [3]-[5]. Furthermore, the temporal lobe of MCI does not show a significant decrease in connectivity, compared with NC. This may be because MCI does not disrupt the temporal lobe as badly as AD. AD MCI NC Fig 4(a): SICE-based brain connectivity models (total number of arcs equal to 50) AD MCI NC Fig 4(b): SICE-based brain connectivity models (total number of arcs equal to 120) AD MCI NC Fig 4(c): SICE-based brain connectivity models (total number of arcs equal to 180) The frontal lobe of AD has significantly more connectivity than NC, which is true across different strength levels of the connectivity. This has been interpreted as compensatory reallocation or recruitment of cognitive resources [6]-[7]. Because the regions in the frontal lobe are typically affected later in the course of AD (our data are early AD), the increased connectivity in the frontal lobe may help preserve some cognitive functions in AD patients. Furthermore, the frontal lobe of MCI does not show a significant increase in connectivity, compared with NC. This indicates that the compensatory effect in MCI brain may not be as strong as that in AD brains. Table 2: P-values from the statistical significance test of connectivity difference among AD, MCI, and NC (a) Total number of arcs = 50 (b) Total number of arcs = 120 (c) Total number of arcs = 180 There is no significant difference among AD, MCI, and NC in terms of the connectivity within the parietal lobe and within the occipital lobe. Another interesting finding is that all the P-values in the third sub-table of Table 2(a) are insignificant. This implies that distribution of the strong connectivity within and between lobes for MCI is very similar to NC; in other words, MCI has not been able to disrupt the strong connectivity among brain regions (it disrupts some mild and weak connectivity though). Between-lobe connectivity: In general, human brains tend to have less between-lobe connectivity than within-lobe connectivity. A majority of the strong connectivity occurs within lobes, but rarely between lobes. These can be clearly seen from Fig. 4 (especially Fig. 4(a)) in which there are much more black cells along the diagonal direction than the off-diagonal direction, regardless of AD, MCI, and NC. The connectivity between the parietal and occipital lobes of AD is significantly more than NC which is true especially for mild and weak connectivity. The increased connectivity between the parietal and occipital lobes of AD has been previously reported in [3]. It is also interpreted as a compensatory effect in [6]-[7]. Furthermore, MCI also shows increased connectivity between the parietal and occipital lobes, compared with NC, but the increase is not as significant as AD. While the connectivity between the frontal and occipital lobes shows little difference between AD and NC, such connectivity for MCI shows a significant decrease especially for mild and weak connectivity. Also, AD may have less temporal-occipital connectivity, less frontal-parietal connectivity, but more parietal-temporal connectivity than NC. Between-hemisphere connectivity: Recall that we have observed from the tree-like plots in Figs. 3 and 4 that the same brain regions in the left and right hemisphere are connected much weaker in AD than in NC. It is desirable to test if this observed difference is statistically significant. To achieve this, we test the statistical significance of the difference among AD, MCI, and NC, in term of the number of connected same-region left-right pairs. Results show that when the total number of arcs in the connectivity models is equal to 120 or 90, none of the tests is significant. However, when the total number of arcs is equal to 50, the P-values of the tests for “AD vs. NC”, “AD vs. MCI”, and “MCI vs. NC” are 0.009, 0.004, and 0.315, respectively. We further perform tests for the total number of arcs equal to 30 and find the P-values to be 0. 0055, 0.053, and 0.158, respectively. These results indicate that AD disrupts the strong connectivity between the same regions of the left and right hemispheres, whereas this disruption is not significant in MCI. 4 Con cl u si on In the paper, we applied SICE to model functional brain connectivity of AD, MCI, and NC based on PET neuroimaging data, and analyze the patterns based on the monotone property of SICE. Our findings were consistent with the previous literature and also showed some new aspects that may suggest further investigation in brain connectivity research in the future. R e f e re n c e s [1] S. Molchan. (2005) The Alzheimer's disease neuroimaging initiative. Business Briefing: US Neurology Review, pp.30-32, 2005. [2] C.J. Stam, B.F. Jones, G. Nolte, M. Breakspear, and P. Scheltens. (2007) Small-world networks and functional connectivity in Alzheimer’s disease. Cerebral Corter 17:92-99. [3] K. Supekar, V. Menon, D. Rubin, M. Musen, M.D. Greicius. (2008) Network Analysis of Intrinsic Functional Brain Connectivity in Alzheimer's Disease. PLoS Comput Biol 4(6) 1-11. [4] K. Wang, M. Liang, L. Wang, L. Tian, X. Zhang, K. Li and T. Jiang. (2007) Altered Functional Connectivity in Early Alzheimer’s Disease: A Resting-State fMRI Study, Human Brain Mapping 28, 967978. [5] N.P. Azari, S.I. Rapoport, C.L. Grady, M.B. Schapiro, J.A. Salerno, A. Gonzales-Aviles. (1992) Patterns of interregional correlations of cerebral glucose metabolic rates in patients with dementia of the Alzheimer type. Neurodegeneration 1: 101–111. [6] R.L. Gould, B.Arroyo, R,G. Brown, A.M. Owen, E.T. Bullmore and R.J. Howard. (2006) Brain Mechanisms of Successful Compensation during Learning in Alzheimer Disease, Neurology 67, 1011-1017. [7] Y. Stern. (2006) Cognitive Reserve and Alzheimer Disease, Alzheimer Disease Associated Disorder 20, 69-74. [8] K.J. Friston. (1994) Functional and effective connectivity: A synthesis. Human Brain Mapping 2, 56-78. [9] G. Alexander, J. Moeller. (1994) Application of the Scaled Subprofile model: a statistical approach to the analysis of functional patterns in neuropsychiatric disorders: A principal component approach to modeling regional patterns of brain function in disease. Human Brain Mapping, 79-94. [10] V.D. Calhoun, T. Adali, G.D. Pearlson, J.J. Pekar. (2001) Spatial and temporal independent component analysis of functional MRI data containing a pair of task-related waveforms. Hum.Brain Mapp. 13, 43-53. [11] V.D. Calhoun, T. Adali, J.J. Pekar, G.D. Pearlson. (2003) Latency (in)sensitive ICA. Group independent component analysis of fMRI data in the temporal frequency domain. Neuroimage. 20, 1661-1669. [12] A.R. McIntosh, F.L. Bookstein, J.V. Haxby, C.L. Grady. (1996) Spatial pattern analysis of functional brain images using partial least squares. Neuroimage. 3, 143-157. [13] K.J. Worsley, J.B. Poline, K.J. Friston, A.C. Evans. (1997) Characterizing the response of PET and fMRI data using multivariate linear models. Neuroimage. 6, 305-319. [14] E. Bullmore, B. Horwitz, G. Honey, M. Brammer, S. Williams, T. Sharma. (2000) How good is good enough in path analysis of fMRI data? NeuroImage 11, 289–301. [15] A.R. McIntosh, C.L. Grady, L.G. Ungerieider, J.V. Haxby, S.I. Rapoport, B. Horwitz. (1994) Network analysis of cortical visual pathways mapped with PET. J. Neurosci. 14 (2), 655–666. [16] K.J. Friston, L. Harrison, W. Penny. (2003) Dynamic causal modelling. Neuroimage 19, 1273-1302. [17] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. (2008) Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research 9:485516. [18] J. Dahl, L. Vandenberghe, and V. Roycowdhury. (2008) Covariance selection for nonchordal graphs via chordal embedding. Optimization Methods Software 23(4):501-520. [19] J. Friedman, T.astie, and R. Tibsirani. (2007) Spares inverse covariance estimation with the graphical lasso, Biostatistics 8(1):1-10. [20] J.Z. Huang, N. Liu, M. Pourahmadi, and L. Liu. (2006) Covariance matrix selection and estimation via penalized normal likelihood. Biometrika, 93(1):85-98. [21] H. Li and J. Gui. (2005) Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7(2):302-317. [22] Y. Lin. (2007) Model selection and estimation in the gaussian graphical model. Biometrika 94(1)19-35, 2007. [23] A. Dobra, C. Hans, B. Jones, J.R. Nevins, G. Yao, and M. West. (2004) Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis 90(1):196-212. [24] A. Berge, A.C. Jensen, and A.H.S. Solberg. (2007) Sparse inverse covariance estimates for hyperspectral image classification, Geoscience and Remote Sensing, IEEE Transactions on, 45(5):1399-1407. [25] J.A. Bilmes. (2000) Factored sparse inverse covariance matrices. In ICASSP:1009-1012. [26] L. Sun and et al. (2009) Mining Brain Region Connectivity for Alzheimer's Disease Study via Sparse Inverse Covariance Estimation. In KDD: 1335-1344. [27] R. Tibshirani. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B 58(1):267-288. [28] N. Tzourio-Mazoyer and et al. (2002) Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single subject brain. Neuroimage 15:273-289. [29] Supplemental information for “Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data”. http://www.public.asu.edu/~jye02/Publications/AD-supplemental-NIPS09.pdf

4 0.08339718 24 nips-2009-Adapting to the Shifting Intent of Search Queries

Author: Umar Syed, Aleksandrs Slivkins, Nina Mishra

Abstract: Search engines today present results that are often oblivious to recent shifts in intent. For example, the meaning of the query ‘independence day’ shifts in early July to a US holiday and to a movie around the time of the box office release. While no studies exactly quantify the magnitude of intent-shifting traffic, studies suggest that news events, seasonal topics, pop culture, etc account for 1/2 the search queries. This paper shows that the signals a search engine receives can be used to both determine that a shift in intent happened, as well as find a result that is now more relevant. We present a meta-algorithm that marries a classifier with a bandit algorithm to achieve regret that depends logarithmically on the number of query impressions, under certain assumptions. We provide strong evidence that this regret is close to the best achievable. Finally, via a series of experiments, we demonstrate that our algorithm outperforms prior approaches, particularly as the amount of intent-shifting traffic increases. 1

5 0.080373801 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

Author: Rong Jin, Shijun Wang, Yang Zhou

Abstract: In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.

6 0.077791899 223 nips-2009-Sparse Metric Learning via Smooth Optimization

7 0.073320508 205 nips-2009-Rethinking LDA: Why Priors Matter

8 0.067953013 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

9 0.063634969 96 nips-2009-Filtering Abstract Senses From Image Search Results

10 0.056030516 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

11 0.054280464 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

12 0.05251452 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

13 0.050356146 190 nips-2009-Polynomial Semantic Indexing

14 0.049672328 256 nips-2009-Which graphical models are difficult to learn?

15 0.043820485 204 nips-2009-Replicated Softmax: an Undirected Topic Model

16 0.043251202 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

17 0.043125708 157 nips-2009-Multi-Label Prediction via Compressed Sensing

18 0.042983189 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

19 0.042905323 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

20 0.042098258 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.144), (1, -0.002), (2, -0.017), (3, -0.061), (4, 0.06), (5, -0.027), (6, -0.05), (7, -0.019), (8, -0.039), (9, 0.073), (10, -0.014), (11, -0.034), (12, 0.048), (13, 0.056), (14, -0.026), (15, 0.039), (16, -0.007), (17, 0.015), (18, -0.049), (19, -0.008), (20, -0.083), (21, 0.004), (22, -0.017), (23, -0.068), (24, -0.016), (25, 0.131), (26, 0.009), (27, -0.111), (28, 0.041), (29, 0.021), (30, -0.041), (31, 0.042), (32, 0.074), (33, -0.106), (34, -0.095), (35, -0.084), (36, -0.074), (37, 0.013), (38, -0.092), (39, -0.023), (40, -0.072), (41, -0.26), (42, 0.047), (43, 0.012), (44, 0.034), (45, -0.07), (46, 0.03), (47, 0.063), (48, 0.088), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92244554 90 nips-2009-Factor Modeling for Advertisement Targeting

Author: Ye Chen, Michael Kapralov, John Canny, Dmitry Y. Pavlov

Abstract: We adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson) [6], to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. We also approach the important problem of ad positional bias by formulating a one-latent-dimension GaP factorization. Learning from click-through data is intrinsically large scale, even more so for ads. We scale up the algorithm to terabytes of real-world SS and BT data that contains hundreds of millions of users and hundreds of thousands of features, by leveraging the scalability characteristics of the algorithm and the inherent structure of the problem including data sparsity and locality. SpeciÄ?Ĺš cally, we demonstrate two somewhat orthogonal philosophies of scaling algorithms to large-scale problems, through the SS and BT implementations, respectively. Finally, we report the experimental results using Yahoo’s vast datasets, and show that our approach substantially outperform the state-of-the-art methods in prediction accuracy. For BT in particular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach using Poisson regression [11] yielded 0.83. For computational performance, we compare a single-node sparse implementation with a parallel implementation using Hadoop MapReduce, the results are counterintuitive yet quite interesting. We therefore provide insights into the underlying principles of large-scale learning. 1

2 0.55857921 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

Author: Dimitris Margaritis

Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its significance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classifier: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classifiers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (first made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, defined to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise definition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We first introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this definition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and definitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classified as filter, wrapper, or embedded methods. Filter methods work without consulting the classifier (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classifier family. In contrast, wrappers make the classifier an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classification error (for that particular classifier). Finally, embedded algorithms are classifier-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] first cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a filter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was first made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a definition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental definitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classification accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A benefit of using conditional independence is that, while classification error estimates depend on the classifier family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after fixing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following definition. Definition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Definition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufficient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic definition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Definition 2, a Markov boundary is a minimal Markov blanket. We first introduce a theorem that provides an alternative, equivalent definition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary definition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the definition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 defines conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative definition of a boundary. By relaxing these conditions we can produce a more general definition. In particular, an m-Markov boundary is defined as follows. Definition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This definition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisfied and can be omitted; in this case the definition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-specified parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspecified in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difficult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-specified. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We first systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difficult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision defined as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, defined as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% confidence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% confidence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reflecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of infinite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% confidence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difficult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically significant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% significance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artificial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the definition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisfies the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the final set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the final set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We first prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artifical Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9

3 0.45703277 24 nips-2009-Adapting to the Shifting Intent of Search Queries

Author: Umar Syed, Aleksandrs Slivkins, Nina Mishra

Abstract: Search engines today present results that are often oblivious to recent shifts in intent. For example, the meaning of the query ‘independence day’ shifts in early July to a US holiday and to a movie around the time of the box office release. While no studies exactly quantify the magnitude of intent-shifting traffic, studies suggest that news events, seasonal topics, pop culture, etc account for 1/2 the search queries. This paper shows that the signals a search engine receives can be used to both determine that a shift in intent happened, as well as find a result that is now more relevant. We present a meta-algorithm that marries a classifier with a bandit algorithm to achieve regret that depends logarithmically on the number of query impressions, under certain assumptions. We provide strong evidence that this regret is close to the best achievable. Finally, via a series of experiments, we demonstrate that our algorithm outperforms prior approaches, particularly as the amount of intent-shifting traffic increases. 1

4 0.44917315 256 nips-2009-Which graphical models are difficult to learn?

Author: Andrea Montanari, Jose A. Pereira

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9

5 0.4461503 181 nips-2009-Online Learning of Assignments

Author: Matthew Streeter, Daniel Golovin, Andreas Krause

Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1

6 0.43267706 205 nips-2009-Rethinking LDA: Why Priors Matter

7 0.42344368 125 nips-2009-Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data

8 0.40703151 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

9 0.38635173 223 nips-2009-Sparse Metric Learning via Smooth Optimization

10 0.35634628 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

11 0.34826186 226 nips-2009-Spatial Normalized Gamma Processes

12 0.34682235 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

13 0.33264017 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

14 0.33057931 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

15 0.31812719 3 nips-2009-AUC optimization and the two-sample problem

16 0.31433555 51 nips-2009-Clustering sequence sets for motif discovery

17 0.30965483 204 nips-2009-Replicated Softmax: an Undirected Topic Model

18 0.30528882 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

19 0.30443633 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

20 0.30115616 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.014), (24, 0.079), (25, 0.047), (31, 0.015), (35, 0.072), (36, 0.076), (39, 0.045), (58, 0.082), (71, 0.081), (81, 0.021), (84, 0.261), (86, 0.1), (91, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90113157 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

Author: Tomoharu Iwata, Takeshi Yamada, Naonori Ueda

Abstract: We propose a probabilistic topic model for analyzing and extracting contentrelated annotations from noisy annotated discrete data such as web pages stored in social bookmarking services. In these services, since users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e. not content-related. The extraction of content-related annotations can be used as a preprocessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.

same-paper 2 0.80357498 90 nips-2009-Factor Modeling for Advertisement Targeting

Author: Ye Chen, Michael Kapralov, John Canny, Dmitry Y. Pavlov

Abstract: We adapt a probabilistic latent variable model, namely GaP (Gamma-Poisson) [6], to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. We also approach the important problem of ad positional bias by formulating a one-latent-dimension GaP factorization. Learning from click-through data is intrinsically large scale, even more so for ads. We scale up the algorithm to terabytes of real-world SS and BT data that contains hundreds of millions of users and hundreds of thousands of features, by leveraging the scalability characteristics of the algorithm and the inherent structure of the problem including data sparsity and locality. SpeciÄ?Ĺš cally, we demonstrate two somewhat orthogonal philosophies of scaling algorithms to large-scale problems, through the SS and BT implementations, respectively. Finally, we report the experimental results using Yahoo’s vast datasets, and show that our approach substantially outperform the state-of-the-art methods in prediction accuracy. For BT in particular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach using Poisson regression [11] yielded 0.83. For computational performance, we compare a single-node sparse implementation with a parallel implementation using Hadoop MapReduce, the results are counterintuitive yet quite interesting. We therefore provide insights into the underlying principles of large-scale learning. 1

3 0.79392642 74 nips-2009-Efficient Bregman Range Search

Author: Lawrence Cayton

Abstract: We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tasks like outlier detection and information retrieval. In metric spaces, efficient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe an algorithm for range search for an arbitrary Bregman divergence. This broad class of dissimilarity measures includes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and a variety of matrix divergences. Metric methods cannot be directly applied since Bregman divergences do not in general satisfy the triangle inequality. We derive geometric properties of Bregman divergences that yield an efficient algorithm for range search based on a recently proposed space decomposition for Bregman divergences. 1

4 0.69442624 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

Author: Liang Sun, Jun Liu, Jianhui Chen, Jieping Ye

Abstract: We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model, in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the (2, 1)-norm minimization, which is an extension of the well-known 1-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is significantly much more difficult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semidefinite programming (SDP) problem, which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efficient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.

5 0.60402638 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1

6 0.60052896 260 nips-2009-Zero-shot Learning with Semantic Output Codes

7 0.59879667 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

8 0.59509587 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

9 0.59328079 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

10 0.59272766 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

11 0.59041148 204 nips-2009-Replicated Softmax: an Undirected Topic Model

12 0.59036803 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

13 0.58856761 16 nips-2009-A Smoothed Approximate Linear Program

14 0.58807671 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

15 0.58794737 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

16 0.58788133 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

17 0.58727324 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

18 0.58724374 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

19 0.58682698 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

20 0.58550543 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization