nips nips2006 nips2006-119 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 au Abstract The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. [sent-8, score-0.539]
2 Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. [sent-9, score-0.21]
3 In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. [sent-10, score-0.214]
4 We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. [sent-12, score-0.214]
5 We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. [sent-13, score-0.145]
6 We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. [sent-14, score-0.221]
7 Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. [sent-15, score-0.204]
8 1 Introduction In many inference tasks, the cost function1 used to assess the final quality of the system is not the one used during training. [sent-16, score-0.214]
9 For example for classification tasks, an error rate for a binary SVM classifier might be reported, although the cost function used to train the SVM only very loosely models the number of errors on the training set, and similarly neural net training uses smooth costs, such as MSE or cross entropy. [sent-17, score-0.366]
10 Thus often in machine learning tasks, there are actually two cost functions: the desired cost, and the one used in the optimization process. [sent-18, score-0.211]
11 The optimization cost plays two roles: it is chosen to make the optimization task tractable (smooth, convex etc. [sent-20, score-0.242]
12 This mismatch between target and optimization costs is not limited to classification tasks, and is particularly acute for information retrieval. [sent-22, score-0.165]
13 For example, [10] list nine target quality measures that are commonly used in information retrieval, all of which depend only on the sorted order of the documents2 and their labeled relevance. [sent-23, score-0.227]
14 The target costs are usually averaged over a large number of queries to arrive at a single cost that can be used to assess the algorithm. [sent-24, score-0.392]
15 These target costs present severe challenges to machine learning: they are either flat (have zero gradient with respect to the model scores), or are discontinuous, everywhere. [sent-25, score-0.161]
16 It is very likely that a significant mismatch between the target and optimizations costs will have a substantial adverse impact on the accuracy of the algorithm. [sent-26, score-0.134]
17 1 Throughout this paper, we will use the terms “cost function” and “quality measure” interchangeably, with the understanding that the cost function is some monotonic decreasing function of the corresponding quality measure. [sent-27, score-0.214]
18 2 For concreteness we will use the term ‘documents’ for the items returned for a given query, although the returned items can be more general (e. [sent-28, score-0.26]
19 Perhaps the first approach that comes to mind would be to design smoothed versions of the cost function, but the inherent ’sort’ makes this very challenging. [sent-32, score-0.18]
20 The method is simple and very general: it can be used for any target cost function. [sent-34, score-0.246]
21 Notation: for the search problem, we denote the score of the ranking function by s ij , where i = 1, . [sent-37, score-0.225]
22 , ni indexes the documents returned for that query. [sent-43, score-0.393]
23 The general cost function is denoted C({sij }, {lij }), where the curly braces denote sets of cardinality ni , and where lij is the label of the j’th document returned for the i’th query, where j indexes the documents sorted by score. [sent-44, score-0.786]
24 We will drop the query index i when the meaning is clear. [sent-45, score-0.247]
25 Ranked lists are indexed from the top, which is convenient when list length varies, and to conform with the notion that high rank means closer to the top of the list, we will take “higher rank” to mean “lower rank index”. [sent-46, score-0.298]
26 2 Common Quality Measures Used in Information Retrieval We list some commonly used quality measures for information retrieval tasks: see [10] and references therein for details. [sent-51, score-0.157]
27 Mean Reciprocal Rank (MRR) is also a binary measure: if ri is the rank of the highest ranking relevant document for the i’th query, NQ 1 then the MRR is just the reciprocal rank, averaged over queries: MRR = NQ i=1 1/ri . [sent-55, score-0.448]
28 Winner Takes All (WTA) is a binary measure for which, if the top ranked document for a given query is relevant, the WTA cost is zero, otherwise it is one. [sent-57, score-0.636]
29 In fact for binary classification tasks, the pair-wise correct is the same as the AUC, which has led to work exploring optimizing the AUC using ranking algorithms [15, 3]. [sent-60, score-0.145]
30 bpref biases the pairwise correct to the top part of the ranking by choosing a subset of documents from which to compute the pairs [1, 10]. [sent-61, score-0.457]
31 The Normalized Discounted Cumulative Gain (NDCG) is a cumulative, multilevel measure of ranking quality that is usually truncated at a particular rank level [6]. [sent-62, score-0.371]
32 For a given query Qi the NDCG is computed as L (2r(j) − 1)/ log(1 + j) Ni ≡ N i (1) j=1 where r(j) is the relevance level of the j’th document, and where the normalization constant N i is chosen so that a perfect ordering would result in Ni = 1. [sent-63, score-0.302]
33 Here L is the ranking truncation level at which the NDCG is computed. [sent-64, score-0.287]
34 NDCG is particularly well suited to Web search applications because it is multilevel and because the truncation level can be chosen to reflect how many documents are shown to the user. [sent-66, score-0.457]
35 3 Previous Work The ranking task is the task of finding a sort on a set, and as such is related to the task of learning structured outputs. [sent-68, score-0.211]
36 The ranking problem also maps outputs (documents) to the reals, but solves a much simpler problem in that the number of documents to be sorted is tractable. [sent-71, score-0.438]
37 Our focus is on a very different aspect of the problem, namely, finding ways to directly optimize the cost that the user ultimately cares about. [sent-72, score-0.18]
38 As in [7], we handle cost functions that are multivariate, in the sense that the number of documents returned for a given query can itself vary, but the key challenge we address in this paper is how to work with costs that are everywhere either flat or non-differentiable. [sent-73, score-0.793]
39 Most cost functions used in machine learning are instead reducible (for example, MSE, cross entropy, log likelihood, and the costs commonly used in kernel methods). [sent-76, score-0.276]
40 The ranking problem itself has attracted increasing attention recently (see for example [4, 2, 8]), and in this paper we will use the RankNet algorithm of [2] as a baseline, since it is both easy to implement and performs well on large retrieval tasks. [sent-77, score-0.197]
41 4 LambdaRank One approach to working with a nonsmooth target cost function would be to search for an optimization function which is a good approximation to the target cost, but which is also smooth. [sent-78, score-0.42]
42 However, the sort required by information retrieval cost functions makes this problematic. [sent-79, score-0.298]
43 We illustrate the idea with an example which also demonstrates the perils introduced by a target / optimization cost mismatch. [sent-82, score-0.277]
44 Let the target cost be WTA and let the chosen optimization cost be a smooth approximation to pairwise error. [sent-83, score-0.519]
45 Suppose that a ranking algorithm A is being trained, and that at some iteration, for a query for which there are only two relevant documents D1 and D2 , A gives D1 rank one and D2 rank n. [sent-84, score-0.865]
46 Then on this query, A has WTA cost zero, but a pairwise error cost of n − 2. [sent-85, score-0.394]
47 If the parameters of A are adjusted so that D1 has rank two, and D2 rank three, then the WTA error is now maximized, but the number of pairwise errors has been reduced by n − 4. [sent-86, score-0.262]
48 Now suppose that at the next iteration, D1 is at rank two, and D2 at rank n 1. [sent-87, score-0.228]
49 The change in D1 ’s score that is required to move it to top position is clearly less (possibly much less) than the change in D 2 ’s score required to move it to top position. [sent-88, score-0.116]
50 If j1 and j2 are the rank indices of D1 , D2 respectively, then instead of pairwise error, we would prefer an optimization cost C that has the property that | ∂C | ∂sj1 | ∂C | ∂sj2 (2) whenever j2 j1 . [sent-90, score-0.359]
51 By only having to specify rules for a given ordering, we are defining the gradients of an implicit cost function C only at the particular points in which we are interested. [sent-92, score-0.284]
52 Let us write the gradient of C with respect to the score of the document at rank position j, for the i’th query, as ∂C = −λj (s1 , l1 , · · · , sni , lni ) ∂sj (3) The sign is chosen so that positive λj means that the document must move up the ranked list to reduce the cost. [sent-95, score-0.52]
53 Thus, in this framework choosing an implicit cost function amounts to choosing suitable λj , which themselves are specified by rules that can depend on the ranked order (and scores) of all the documents. [sent-96, score-0.341]
54 Now for a given query Q i and corresponding set of returned Dij , the ni λ’s are functions of the scores sij , parameterized by the (fixed) labels lij . [sent-104, score-0.531]
55 Let dxj be a basis of 1-forms on Rn and define the 1-form λj dxj λ≡ (4) j Then assuming that the scores are defined over Rn , the conditions for the theorem are satisfied and λ = dC for some function C if and only if dλ = 0 everywhere. [sent-105, score-0.126]
56 , ni } (5) ∂sk ∂sj This provides a simple test on the λ’s to determine if there exists a cost function for which they are the derivatives: the Jacobian (that is, the matrix Jjk ≡ ∂λj /∂sk ) must be symmetric. [sent-109, score-0.246]
57 Furthermore, given that such a cost function C does exist, then since its Hessian is just the above Jacobian, the condition that C be convex is that the Jacobian be positive semidefinite everywhere. [sent-110, score-0.18]
58 Under these constraints, the Jacobian looks rather like a kernel matrix, except that while an entry of a kernel matrix depends on two elements of a vector space, an entry of the Jacobian can depend on all of the scores sj . [sent-111, score-0.18]
59 Note that for constant λ’s, the above two conditions are trivially satisfied, and that for other choices that give rise to symmetric J, positive definiteness can be imposed by adding diagonal regularization terms of the form λj → λj + αj sj , αj > 0. [sent-112, score-0.108]
60 Think of the documents returned for a given query as point masses. [sent-114, score-0.545]
61 (5) are met, then the forces in the model are conservative, that is, they may be viewed as arising from a potential energy function, which in our case is the implicit cost function C. [sent-117, score-0.214]
62 Similarly if a contribution to a document A’s λ is computed based on its position with respect to document B, then B’s λ should be incremented by an equal and opposite amount, to prevent the pair itself from accelerating (Newton’s third law, [9]). [sent-122, score-0.24]
63 It requires only that one provide rules for the derivatives of the implicit cost for any given sorted order of the documents, and as we will show, such rules are easy to come up with. [sent-124, score-0.382]
64 RankNet is trained on those pairs of feature vectors, for a given query, for which the corresponding documents have different labels. [sent-127, score-0.252]
65 At runtime, single feature vectors are fpropped through the net, and the documents are ordered by the resulting scores. [sent-128, score-0.213]
66 The RankNet cost consists of a sigmoid (to map the outputs to [0, 1]) followed by a pair-based cross entropy cost, and takes the form given in Eq. [sent-129, score-0.254]
67 The ideas proposed in Section 4 suggest a simple method for significantly speeding up RankNet training, making it also approximately linear in the number of labeled documents per query, rather than in the number of pairs per query. [sent-132, score-0.343]
68 In fact the method works for any ranking method that uses gradient descent and for which the cost depends on pairs of items for each query. [sent-134, score-0.436]
69 However here we will use batch learning per query (that is, the weights are updated for each query). [sent-136, score-0.32]
70 We present the idea for a general ranking function f : Rn → R with optimization cost C : R → R. [sent-137, score-0.356]
71 It is important to note that adopting batch training alone does not give a speedup: to compute the cost and its gradients we would still need to fprop each pair. [sent-138, score-0.329]
72 Consider a single query for which n documents have been returned. [sent-139, score-0.46]
73 Let the output scores of the ranker be sj , j = 1, . [sent-140, score-0.18]
74 , n, the model parameters be wk ∈ R, and let the set of pairs of document indices used for training be P. [sent-143, score-0.279]
75 Then we can write the first term as ∂CT = ∂wk i∈D ∂si ∂wk j∈Pi ∂C(si , sj ) ∂si (7) and similarly for the second. [sent-145, score-0.108]
76 The algorithm is as follows: instead of backpropping each pair, first n fprops are performed to compute the si (and for the general LambdaRank algorithm, this would also ∂C(si ,sj ) be where the sort on the scores is performed); then for each i = 1, . [sent-146, score-0.275]
77 , n the λi ≡ j∈Pi ∂si ∂si are computed; then to compute the gradients ∂wk , n fprops are performed, and finally the n backprops are done. [sent-149, score-0.131]
78 6 Experiments We performed experiments to (1) demonstrate the training speedup for RankNet, and (2) assess whether LambdaRank improves the NDCG test performance. [sent-152, score-0.123]
79 Even though the RankNet optimization cost is not NDCG, RankNet is still very effective at optimizing NDCG, using the method proposed in [2]: after each epoch, compute the NDCG on a validation set, and after training, choose the net for which the validation NDCG is highest. [sent-154, score-0.361]
80 Rather than attempt to derive from first principles the optimal Lambda function for the NDCG target cost (and for a given dataset), which is beyond the scope of this paper, we wrote several plausible λfunctions and tested them on the Web search data. [sent-155, score-0.294]
81 We will refer to the original RankNet training as V1 and LambdaRank speedup as V2. [sent-159, score-0.123]
82 In the first we used 1000 queries taken from the Web data described below, and in the second we varied the number of documents for a given query, using the artificial data described below. [sent-161, score-0.291]
83 We compared V1 to V2 for 1 layer and 2 layer (with 10 hidden nodes) nets. [sent-164, score-0.146]
84 V1 was also run using batch update per query, to clearly show the gain (the convergence as a function of epoch was found to be similar for batch and non-batch updates; furthermore running time for batch and non-batch is almost identical). [sent-165, score-0.23]
85 Thus V1 is close to quadratic (but not exactly, due to the fact that only a subset of pairs is used, namely, those with documents whose labels differ), and V1 is close to linear, as expected. [sent-176, score-0.252]
86 Using the physical analogy, specifying a λ function amounts to specifying rules for the ‘force’ on a document given its neighbors in the ranked list. [sent-192, score-0.247]
87 All λ functions were designed with the NDCG cost function in mind, and most had a margin built in (that is, a force is exerted between two documents even if they are in the correct order, until their difference in scores exceeds that margin). [sent-194, score-0.5]
88 This function used the RankNet cost, scaled by the NDCG gain found by swapping the two documents in question. [sent-198, score-0.277]
89 Thus for each pair, after the sort, we increment each document’s force by ±λ, where the more relevant document gets the positive increment. [sent-202, score-0.187]
90 3 Ranking for Search Experiments We performed experiments on three datasets: artificial, web search, and intranet search data. [sent-204, score-0.2]
91 The data are labeled from 0 to M , in order of increasing relevance: the Web search and artificial data have M = 4, and the intranet search data, M = 3. [sent-205, score-0.18]
92 (8), for RankNet, and the NDCG at truncation level 10, for LambdaRank) increased over the value for the previous epoch. [sent-212, score-0.142]
93 Training was done for 300 epochs for the artificial and Web search data, and for 200 epochs for the intranet data, and training was restarted (with random weights) if the cost did not reduce for 50 iterations. [sent-213, score-0.401]
94 We used 50 dimensional data, 50 documents per query, and 10K/5K/10K queries for train/valid/test respectively. [sent-219, score-0.319]
95 We report the NDCG results in Figure 2 for ten NDCG truncation levels. [sent-220, score-0.118]
96 For the two layer nets the NDCG means are even closer. [sent-245, score-0.113]
97 In Figure 3, we report the NDCG scores on the dataset at truncation levels from 1 to 10. [sent-251, score-0.19]
98 We show separate plots to clearly show the differences: in fact, the linear LambdaRank results lie on top of the two layer RankNet results, for the larger truncation values. [sent-252, score-0.217]
99 Right: two layer nets furnishes a significant training speedup there. [sent-270, score-0.236]
100 We studied LambdaRank in the context of the NDCG target cost for neural network models, but the same ideas apply to any non-smooth target cost, and to any differentiable function class. [sent-271, score-0.312]
wordName wordTfidf (topN-words)
[('ndcg', 0.49), ('ranknet', 0.405), ('lambdarank', 0.372), ('query', 0.247), ('documents', 0.213), ('cost', 0.18), ('ranking', 0.145), ('document', 0.12), ('truncation', 0.118), ('rank', 0.114), ('sj', 0.108), ('returned', 0.085), ('intranet', 0.084), ('speedup', 0.082), ('wta', 0.08), ('wk', 0.079), ('queries', 0.078), ('layer', 0.073), ('scores', 0.072), ('jacobian', 0.071), ('si', 0.069), ('web', 0.068), ('costs', 0.068), ('fprops', 0.068), ('mrr', 0.068), ('rankn', 0.068), ('ni', 0.066), ('sort', 0.066), ('target', 0.066), ('ranked', 0.063), ('trec', 0.059), ('sorted', 0.056), ('multilevel', 0.054), ('microsoft', 0.054), ('retrieval', 0.052), ('validation', 0.051), ('nq', 0.051), ('net', 0.048), ('search', 0.048), ('items', 0.045), ('batch', 0.045), ('list', 0.044), ('esi', 0.044), ('answering', 0.044), ('rules', 0.041), ('training', 0.041), ('epoch', 0.04), ('springs', 0.04), ('nets', 0.04), ('pairs', 0.039), ('reciprocal', 0.037), ('swapping', 0.037), ('sigir', 0.037), ('lij', 0.037), ('speeding', 0.035), ('force', 0.035), ('quality', 0.034), ('pairwise', 0.034), ('backprops', 0.034), ('fprop', 0.034), ('lam', 0.034), ('implicit', 0.034), ('relevant', 0.032), ('th', 0.032), ('score', 0.032), ('optimization', 0.031), ('burges', 0.031), ('relevance', 0.031), ('derivatives', 0.03), ('gradients', 0.029), ('redmond', 0.029), ('spend', 0.029), ('nonsmooth', 0.029), ('indexes', 0.029), ('cross', 0.028), ('smooth', 0.028), ('per', 0.028), ('gradient', 0.027), ('measures', 0.027), ('gain', 0.027), ('lay', 0.027), ('spring', 0.027), ('dxj', 0.027), ('top', 0.026), ('reals', 0.025), ('commercial', 0.025), ('arti', 0.024), ('cial', 0.024), ('multivariate', 0.024), ('level', 0.024), ('outputs', 0.024), ('gave', 0.024), ('sij', 0.024), ('epochs', 0.024), ('bonn', 0.024), ('ct', 0.023), ('amounts', 0.023), ('icml', 0.023), ('sigmoid', 0.022), ('discontinuous', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
2 0.14496422 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers
Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to find documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the specific word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the specific constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly specific in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and specificity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more specific method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-specific word distributions to capture both general as well as specific information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and specificity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and specificity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-specific “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and specific to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-specific trade-off between the benefits of topic-based abstraction and specific word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classification problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of specific individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and specificity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are fixed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Griffiths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efficiently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution specific for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-specific multinomial Ψ or a corpus specific multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-specific multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufficient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-specific Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the final word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are specific to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included first and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were fit using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more specific details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is defined as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-specific parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity figures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have significantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both specific and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often specific query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more specific matches than either LDA or LSI, indicating that they have the ability to match at the level of specific words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) fields. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is first mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical significance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not significant. In figure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically significant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically significant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and specific aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account specific aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner specific to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very specific query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Griffiths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-specific scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classifiers. Journal of the Artificial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Griffiths, T. L., and Steyvers, M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.
3 0.096028723 166 nips-2006-Recursive Attribute Factoring
Author: Deepak Verma, Karl Pfleger, David Tax
Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientific papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1
4 0.082945809 156 nips-2006-Ordinal Regression by Extended Binary Classification
Author: Ling Li, Hsuan-tien Lin
Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1
5 0.082429022 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
6 0.066526175 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
7 0.064468808 129 nips-2006-Map-Reduce for Machine Learning on Multicore
8 0.058800183 116 nips-2006-Learning from Multiple Sources
9 0.057244062 50 nips-2006-Chained Boosting
10 0.056818459 88 nips-2006-Greedy Layer-Wise Training of Deep Networks
11 0.048011735 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
12 0.047604367 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
13 0.047580212 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
14 0.046078317 7 nips-2006-A Local Learning Approach for Clustering
15 0.045442466 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
16 0.045022476 47 nips-2006-Boosting Structured Prediction for Imitation Learning
17 0.044436168 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
18 0.043811481 66 nips-2006-Detecting Humans via Their Pose
19 0.043649886 130 nips-2006-Max-margin classification of incomplete data
20 0.042622603 75 nips-2006-Efficient sparse coding algorithms
topicId topicWeight
[(0, -0.157), (1, 0.016), (2, 0.009), (3, -0.049), (4, 0.025), (5, 0.036), (6, -0.034), (7, 0.017), (8, -0.021), (9, 0.016), (10, -0.088), (11, 0.032), (12, 0.005), (13, -0.027), (14, 0.005), (15, 0.029), (16, -0.047), (17, 0.016), (18, 0.162), (19, -0.139), (20, -0.019), (21, -0.074), (22, 0.093), (23, 0.094), (24, -0.152), (25, -0.017), (26, -0.119), (27, -0.108), (28, 0.11), (29, -0.148), (30, 0.038), (31, -0.159), (32, 0.024), (33, -0.015), (34, 0.067), (35, -0.1), (36, 0.134), (37, 0.068), (38, 0.002), (39, 0.007), (40, -0.034), (41, -0.07), (42, 0.109), (43, 0.156), (44, 0.08), (45, -0.006), (46, -0.053), (47, -0.063), (48, -0.127), (49, 0.103)]
simIndex simValue paperId paperTitle
same-paper 1 0.93367255 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
2 0.67182851 166 nips-2006-Recursive Attribute Factoring
Author: Deepak Verma, Karl Pfleger, David Tax
Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientific papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1
3 0.66961092 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers
Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to find documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the specific word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the specific constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly specific in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and specificity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more specific method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-specific word distributions to capture both general as well as specific information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and specificity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and specificity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-specific “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and specific to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-specific trade-off between the benefits of topic-based abstraction and specific word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classification problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of specific individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and specificity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are fixed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Griffiths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efficiently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution specific for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-specific multinomial Ψ or a corpus specific multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-specific multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufficient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-specific Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the final word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are specific to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included first and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were fit using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more specific details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is defined as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-specific parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity figures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have significantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both specific and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often specific query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more specific matches than either LDA or LSI, indicating that they have the ability to match at the level of specific words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) fields. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is first mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical significance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not significant. In figure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically significant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically significant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and specific aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account specific aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner specific to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very specific query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Griffiths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-specific scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classifiers. Journal of the Artificial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Griffiths, T. L., and Steyvers, M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.
4 0.46567696 129 nips-2006-Map-Reduce for Machine Learning on Multicore
Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1
5 0.4045592 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
Author: Sandeep Pandey, Christopher Olston
Abstract: We consider how a search engine should select advertisements to display with search results, in order to maximize its revenue. Under the standard “pay-per-click” arrangement, revenue depends on how well the displayed advertisements appeal to users. The main difficulty stems from new advertisements whose degree of appeal has yet to be determined. Often the only reliable way of determining appeal is exploration via display to users, which detracts from exploitation of other advertisements known to have high appeal. Budget constraints and finite advertisement lifetimes make it necessary to explore as well as exploit. In this paper we study the tradeoff between exploration and exploitation, modeling advertisement placement as a multi-armed bandit problem. We extend traditional bandit formulations to account for budget constraints that occur in search engine advertising markets, and derive theoretical bounds on the performance of a family of algorithms. We measure empirical performance via extensive experiments over real-world data. 1
6 0.37984166 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
7 0.37340888 156 nips-2006-Ordinal Regression by Extended Binary Classification
8 0.37298232 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
9 0.36965147 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
10 0.33473355 47 nips-2006-Boosting Structured Prediction for Imitation Learning
11 0.32690835 167 nips-2006-Recursive ICA
12 0.30155531 66 nips-2006-Detecting Humans via Their Pose
13 0.29617977 88 nips-2006-Greedy Layer-Wise Training of Deep Networks
14 0.27283761 186 nips-2006-Support Vector Machines on a Budget
15 0.27031776 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
16 0.26996031 5 nips-2006-A Kernel Method for the Two-Sample-Problem
17 0.2658127 169 nips-2006-Relational Learning with Gaussian Processes
18 0.25181928 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
19 0.24961503 139 nips-2006-Multi-dynamic Bayesian Networks
20 0.24434949 116 nips-2006-Learning from Multiple Sources
topicId topicWeight
[(1, 0.099), (3, 0.035), (7, 0.115), (9, 0.052), (20, 0.022), (22, 0.07), (44, 0.066), (57, 0.099), (65, 0.054), (66, 0.243), (69, 0.043), (71, 0.01)]
simIndex simValue paperId paperTitle
1 0.84834057 41 nips-2006-Bayesian Ensemble Learning
Author: Hugh A. Chipman, Edward I. George, Robert E. Mcculloch
Abstract: We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backfitting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is defined by a statistical model: a prior and a likelihood, while boosting is defined by an algorithm. This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. 1
same-paper 2 0.80508882 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
3 0.65478784 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
4 0.65162098 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
Author: Sridevi Parise, Max Welling
Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1
5 0.65161002 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
Author: Gloria Haro, Gregory Randall, Guillermo Sapiro
Abstract: The study of point cloud data sampled from a stratification, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artificial and real examples demonstrating the importance of extending manifold learning to stratification learning. 1
6 0.65062422 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
7 0.64918369 34 nips-2006-Approximate Correspondences in High Dimensions
8 0.64903635 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
9 0.64859116 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
10 0.64785445 158 nips-2006-PG-means: learning the number of clusters in data
11 0.64552695 110 nips-2006-Learning Dense 3D Correspondence
12 0.64498889 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
13 0.64489275 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
14 0.64433855 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
15 0.64420921 65 nips-2006-Denoising and Dimension Reduction in Feature Space
16 0.64381647 117 nips-2006-Learning on Graph with Laplacian Regularization
17 0.64347398 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
18 0.64338249 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
19 0.64242184 175 nips-2006-Simplifying Mixture Models through Function Approximation
20 0.64238185 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning