nips nips2006 nips2006-166 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Deepak Verma, Karl Pfleger, David Tax
Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientific papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. [sent-9, score-1.216]
2 Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientific papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. [sent-10, score-0.801]
3 Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. [sent-11, score-0.212]
4 We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. [sent-12, score-0.937]
5 Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. [sent-13, score-0.295]
6 1 Introduction There is a long and successful history of decomposing collections of documents into factors or clusters to identify “similar” documents and principal themes. [sent-15, score-0.484]
7 Collections have been factored on the basis of their textual contents [1, 2, 3], the connections between the documents [4, 5, 6], or both together [7]. [sent-16, score-0.373]
8 A factored corpus model is usually composed of a small number of “prototype” documents along with a set of mixing coefficients (one for each document in the corpus). [sent-17, score-0.748]
9 Each prototype corresponds to an abstract document whose features are, in some mathematical sense, “typical” of some subset of the corpus documents. [sent-18, score-0.404]
10 The mixing coefficients for a document d indicate how the model’s prototypes can best be combined to approximate d. [sent-19, score-0.37]
11 Many useful applications arise from factored models: • Model prototypes may be used as “topics” or cluster centers in spectral clustering [8] serving as “typical” documents for a class or cluster. [sent-20, score-0.448]
12 • Given a topic, factored models of link corpora allow identifying authoritative documents on that topic [4, 5, 6]. [sent-21, score-0.694]
13 • By exploiting correlations and “projecting out” uninformative terms, the space of a factored model’s mixing coefficients can provide a measure of semantic similarity between documents, regardless of the overlap in their actual terms [1]. [sent-22, score-0.181]
14 The remainder of this paper is organized as follows: Below, we first review the vector space model, formalize the factoring problem, and describe how factoring is applied to linked document collections. [sent-23, score-1.441]
15 The Vector Space Model: The vector space model is a convention for representing a document corpus (ordinarily sets of strings of arbitrary length) as a matrix, in which each document is represented as a column vector. [sent-26, score-0.689]
16 Let the number of documents in the corpus be N and the size of vocabulary M . [sent-27, score-0.351]
17 Then T denotes the M × N term-document matrix such that column j represents document dj , and Tij indicates the number of times term ti appears in document dj . [sent-28, score-0.722]
18 Geometrically, the columns of T can also be viewed as points in an M dimensional space, where each dimension i indexes the number of times term ti appears in the corresponding document. [sent-29, score-0.064]
19 A link-based corpus may also be represented as a vector space, defining an N × N matrix L where Lij = 1 if there is a link from document i to j and 0 otherwise. [sent-30, score-0.601]
20 Factoring: Let A represent a matrix to be factored (usually T or T augmented with some other matrix) into K factors. [sent-32, score-0.148]
21 1 In the geometric interpretation, columns of U contains the K prototypes, while columns of V indicate what mixture of prototypes best approximates the columns in the original matrix. [sent-34, score-0.223]
22 The definition of what constitutes a “best approximation” leads to the many different factoring algorithms in use today. [sent-35, score-0.571]
23 Figure 1: Factoring decomposes matrix A into matrices U and V For the purposes of this paper, however, we are agnostic as to the factorization method used — our main concern is how A , the document matrix to be factored, is generated. [sent-37, score-0.356]
24 1 Factoring Text and Link Corpora When factoring a text corpus (e. [sent-39, score-0.711]
25 via LSA [1], PLSA [2], NMF [3] or some other technique), we directly factor the matrix T . [sent-41, score-0.069]
26 Columns of the resulting M × K matrix U are often interpreted as the K “principal topics” of the corpus, while columns of the K × N matrix V are “topic memberships” of the corpus documents. [sent-42, score-0.254]
27 1 In general, A ≈ f (U , V ), where f can be any function with takes in the weights for a document and the document prototypes to generate the original vector. [sent-43, score-0.617]
28 via ACA [4] or PHITS [6]), we factor L or the normalized link matrix P . [sent-46, score-0.25]
29 Columns of the resulting N × K matrix U are often interpreted as the K “citation communities” of the corpus, and columns of the K × N matrix V indicate to what extent each document belongs to the corresponding community. [sent-47, score-0.402]
30 Additionally, U ij , the degree of citation that community j accords to document di can be interpreted as the “authority” of di in that community. [sent-48, score-0.392]
31 Prior work [7] has demonstrated that building a single factored model of the joint term-link matrix produces a better model than that produced by using text or links alone. [sent-51, score-0.246]
32 The naive way to produce such a joint model is to append L or P below T , and factor the joint matrix: T UT ≈ ×V. [sent-52, score-0.174]
33 Column i of UT indicates the expected term distribution of factor i, while the corresponding column of UL indicates the distribution of documents that typically link to documents represented by that factor. [sent-54, score-0.716]
34 For clarity, we omit reference to Figure 2: The naive joint model λ in the equations below. [sent-56, score-0.1]
35 concatenates term and link matrices 2 Beyond the Naive Joint Model Joint models provide a systematic way of incorporating information from both the terms and link structure present in a corpus. [sent-57, score-0.398]
36 But the naive approach described above does not scale up to web-sized corpora, which may have millions of terms and tens of billions of documents. [sent-58, score-0.109]
37 The matrix resulting from a naive representation of a web-scale problem would have N + M features with N ≈ 1010 and M ≈ 106 . [sent-59, score-0.112]
38 Simply representing this matrix (let alone factoring it) is impractical on a modern workstation. [sent-60, score-0.628]
39 The terms in a document are explicit attributes; links to the document provide additional attributes, represented (in the naive case) as the identities of the inlinking documents. [sent-62, score-0.921]
40 By taking a similar tack, we arrive at Attribute Factoring — the approach of representing link information in terms of the attributes of the inlinking documents, rather than by their explicit identities. [sent-64, score-0.727]
41 1 Attribute Factoring Each document dj , along with an attribute for each term, has an attribute for each other document di in the corpus, signifying the presence (or absence) of a link from di to dj . [sent-66, score-1.431]
42 When N ≈ 1010 , keeping each document identity as a separate attribute is prohibitive. [sent-67, score-0.528]
43 To create a more economical representation, we propose replacing the link attributes by a smaller set of attributes that “summarize” the information from link matrix L , possibly in combination with the term matrix T . [sent-68, score-0.998]
44 The most obvious attributes of a document are what terms it contains. [sent-69, score-0.557]
45 Therefore, one simple way to represent the “attributes” of a document’s inlinks is to aggregate the terms in the documents that link to it. [sent-70, score-0.499]
46 For computational and representational simplicity in this paper, however, we replace inlink identities with a sum of the terms in the inlinking documents. [sent-72, score-0.32]
47 (2) Colloquially, we can look at this representation as saying that a document has “some distribution of terms” (T ) and is linked to by documents that have “some other term distribution” (T × L ). [sent-74, score-0.567]
48 By substituting the aggregated attributes of the inlinks for their identities, we can reduce the size of the representation down from (M +N )×N to a much more manageable 2M ×N . [sent-75, score-0.338]
49 What is surprising is that, on the domains tested, this more compact representation actually improves factoring performance. [sent-76, score-0.617]
50 2 Attribute Factoring Experiments We tested Attribute Factoring on two publiclyavailable corpora of interlinked text documents. [sent-78, score-0.12]
51 For both datasets, we factored the content-only, naive joint, and AF joint representations using PLSA [2]. [sent-82, score-0.212]
52 We varied K, the number of computed factors from 2 to 16, and performed 10 factoring runs for each value of K tested. [sent-83, score-0.571]
53 The factored models were evaluated by clustering Figure 4: Attribute Factoring outperforms the each document to its dominant factor and mea- content-only and naive joint representations suring cluster precision: the fraction of documents in a cluster sharing the majority label. [sent-84, score-0.78]
54 Figure 4 illustrates a typical result: adding explicit link information improves cluster precision, but abstracting the link information with Attribute Factoring improves it even more. [sent-85, score-0.437]
55 3 Beyond Simple Attribute Factoring Attribute Factoring reduces the number of attributes from N +M to 2M , allowing existing factoring techniques to scale to web-sized corpora. [sent-86, score-0.844]
56 This reduction in number of attributes however, comes at a cost. [sent-87, score-0.273]
57 Since the identity of the document itself is replaced by its attributes, it is possible for unscrupulous authors (spammers) to “pose” as a legitimate page with high PageRank. [sent-88, score-0.329]
58 homepage, linked to by many pages, and linking to page RYL (Real Yahoo Link). [sent-91, score-0.117]
59 A link from the Ya- Figure 5: Attribute Factoring can be hoo! [sent-92, score-0.181]
60 homepage to RYL imparts a lot of authority and hence “spammed” by mirroring one level is highly desired by spammers. [sent-93, score-0.157]
61 Failing that, a spammer might back try to create a counterfeit copy of the Yahoo! [sent-94, score-0.112]
62 homepage, boost its PageRank by means of a “link farm”, and create a link from it to his page FYL (Fake Yahoo Link). [sent-95, score-0.212]
63 Without link information, our factoring can not distinguish the counterfeit homepage from the real one. [sent-96, score-0.913]
64 Using AF or the naive joint model allows us to distinguish them based on the distribution of documents that link to each. [sent-97, score-0.514]
65 But with AF, that real/counterfeit distinction is not propagated to documents that they point to. [sent-98, score-0.253]
66 1 Recursive Attribute Factoring Spamming AF was simple because it only looks one link behind. [sent-102, score-0.181]
67 That is, attributes for a document are either explicit terms in that document or explicit terms in documents linking to the current document. [sent-103, score-1.241]
68 homepage was counterfeit, but provided no way to propagate this inference on to later pages. [sent-105, score-0.129]
69 It makes inferences about a document based on explicit attributes propagated from the documents linking to it, but this inference only propagates one level. [sent-107, score-0.902]
70 For example it lets us infer that the fake Yahoo! [sent-108, score-0.064]
71 homepage was counterfeit, but provides no way to propagate this inference on to later pages. [sent-109, score-0.129]
72 This suggests that we need to propagating not only explicit attributes of a document (its component terms), but its inferred attributes as well. [sent-110, score-0.964]
73 A ready source of inferred attributes comes from the factoring process itself. [sent-111, score-0.939]
74 Recall that when factoring T ≈ U × V , if we interpret the columns of U as factors or prototypes, then each Figure 6: Recursive Attribute Faccolumn of V can be interpreted as the inferred factor member- toring aggregates the inferred atships of its corresponding document. [sent-112, score-0.88]
75 Therefore, we can prop- tributes (columns of V ) of inlinkagate the inferred attributes of inlinking documents by aggre- ing documents gating the columns of V they correspond to (Figure 6). [sent-113, score-1.057]
76 Numerically, this replaces T (the explicit document attributes) in the bottom half of the left matrix with V (the inferred document attributes): T V×L ≈ UT UV×L ×V. [sent-114, score-0.72]
77 (3) There are some worrying aspects of this representation: the document representation is no longer statically defined, and the equation itself is recursive. [sent-115, score-0.303]
78 The “inferred” attributes (IA ) are set initially to random values, which are then updated until they converge. [sent-117, score-0.273]
79 It is perhaps not surprising that RAF by itself does not perform as well as AF on 2 We would use L and P interchangeably to represent contribution from inlinking documents distinguishing only in case of “recursive” equations where it is important to normalize L to facilitate convergence. [sent-124, score-0.439]
80 the domains tested3 - when available, explicit information is arguably more powerful than inferred information. [sent-125, score-0.152]
81 It’s important to realize, however, that AF and RAF are in no way exclusive of each other; when we combine the two and propagate both explicit and implicit attributes, our performance is (satisfyingly) better than with either alone (top lines in Figures 7(a) and (b)). [sent-126, score-0.089]
82 In general though, we can set IA to be any matrix that aggregates attributes of a document’s inlinks. [sent-129, score-0.331]
83 4 For AF we can replace the N dimensional inlink vector with a M -dimensional inferred vector di such that di = j:L ji =1 wj dj and then IA would be the matrix with inferred attributes for each document i. [sent-130, score-0.948]
84 Different choices for wj lead to different weighting of aggregation of attributes from the incoming documents; some variations are summarized in Table 1. [sent-133, score-0.328]
85 Summed function Attribute Factoring Outdegree-normalized Attribute Factoring PageRank-weighted Attribute Factoring PageRank- and outdegree-normalized wi 1 Pji Pj P j Pji IA T ×L T ×P T × diag(P ) ×L T ×diag(P ) ×P Table 1: Variations on attribute weighting for Attribute Factoring. [sent-134, score-0.288]
86 (Pj is PageRank of document j) 3 It is somewhat surprising (and disappointing) that RAF performs worse that the content-only model, but other work [7] has posited situations when this may be expected. [sent-135, score-0.295]
87 While useful in conjunction with ordinary Attribute Factoring, its recursive nature and lack of convergence guarantees are troubling. [sent-139, score-0.104]
88 One way to simulate the desired effect of RAF in a closed form is to explicitly model the inlink attributes more than just one level. [sent-140, score-0.321]
89 5 For example, ordinary AF looks back one level at the (explicit) attributes of inlinking documents by setting IA = T × L . [sent-141, score-0.701]
90 The IA matrix would have 2M features (M attributes for inlinking documents and another M for attributes of documents that linked to the inlinking documents). [sent-143, score-1.435]
91 It should be pointed out that these extended attributes rapidly converge to the stationary distribution of terms on the web: T × L ∞ = T × eig(L ), equivalent to weighting inlinking attributes by a version of PageRank that omits random restarts. [sent-146, score-0.767]
92 A smoothed version of the recursive equation, can be written as T +γ·V ×P ≈ UT UV×L ×V. [sent-150, score-0.11]
93 Starting the the original equation let us first remove the explicit attributes. [sent-153, score-0.077]
94 This means that, in the absence of T ’s term data, the inferred attributes V produced by smoothed RAF represent a sort of generalized, multi-dimensional PageRank, where each dimension corresponds to authority on one of the inferred topics of the corpus. [sent-157, score-0.579]
95 6 With the terms of T added, the intuition is that V and the inferred attributes IA = V × P converge to a trade-off between the generalized PageRank of link structure and factor values for T in terms of the prototypes U T capturing term information. [sent-158, score-0.721]
96 5 Summary We have described a representational methodology for factoring web-scale corpora, incorporating both content and link information. [sent-159, score-0.815]
97 The main idea is to represent link information with attributes of the inlinking documents rather than their explicit identities. [sent-160, score-0.921]
98 We have no principled basis for weighting the different kinds of attributes in AF and EAF; while RAF seems to converge reliably in practice, we have no theoretical guarantees that it will always do so. [sent-163, score-0.299]
99 Finally, in spite of our motivating example being the ability to factor very large corpora, we have only tested our algorithms on small “academic” data sets; applying the AF, RAF and EAF to a web-scale corpus remains as the real (and as yet untried) criterion for success. [sent-164, score-0.151]
100 The missing link - a probabilistic model of document content and hypertext connectivity. [sent-204, score-0.476]
wordName wordTfidf (topN-words)
[('factoring', 0.571), ('attributes', 0.273), ('document', 0.266), ('attribute', 0.262), ('documents', 0.233), ('raf', 0.225), ('ia', 0.21), ('link', 0.181), ('inlinking', 0.177), ('pagerank', 0.145), ('corpus', 0.118), ('af', 0.112), ('factored', 0.112), ('corpora', 0.098), ('homepage', 0.097), ('inferred', 0.095), ('recursive', 0.086), ('prototypes', 0.085), ('yahoo', 0.085), ('counterfeit', 0.064), ('fake', 0.064), ('ut', 0.064), ('dj', 0.059), ('naive', 0.059), ('explicit', 0.057), ('linking', 0.053), ('web', 0.052), ('webkb', 0.051), ('authoritative', 0.048), ('cora', 0.048), ('eaf', 0.048), ('inlink', 0.048), ('inlinks', 0.048), ('ryl', 0.048), ('spammer', 0.048), ('columns', 0.046), ('uv', 0.045), ('identities', 0.043), ('plsa', 0.042), ('authority', 0.042), ('joint', 0.041), ('cohn', 0.041), ('di', 0.038), ('matrix', 0.036), ('links', 0.035), ('representational', 0.034), ('factor', 0.033), ('linked', 0.033), ('topics', 0.032), ('billions', 0.032), ('citation', 0.032), ('deepak', 0.032), ('fyl', 0.032), ('iat', 0.032), ('legitimate', 0.032), ('outlinks', 0.032), ('stockholm', 0.032), ('uia', 0.032), ('propagate', 0.032), ('semantic', 0.032), ('page', 0.031), ('variations', 0.029), ('surprising', 0.029), ('content', 0.029), ('contents', 0.028), ('parkway', 0.028), ('amphitheatre', 0.028), ('pji', 0.028), ('weighting', 0.026), ('smoothed', 0.024), ('richardson', 0.024), ('lij', 0.024), ('relational', 0.024), ('thomas', 0.023), ('ul', 0.022), ('aggregates', 0.022), ('text', 0.022), ('topic', 0.022), ('representing', 0.021), ('propagated', 0.02), ('mountain', 0.02), ('equation', 0.02), ('prototype', 0.02), ('mixing', 0.019), ('aggregate', 0.019), ('papers', 0.019), ('cluster', 0.018), ('lot', 0.018), ('ordinary', 0.018), ('term', 0.018), ('david', 0.018), ('interpreted', 0.018), ('terms', 0.018), ('column', 0.018), ('decomposes', 0.018), ('collections', 0.018), ('daniel', 0.017), ('pj', 0.017), ('latent', 0.017), ('representation', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 166 nips-2006-Recursive Attribute Factoring
Author: Deepak Verma, Karl Pfleger, David Tax
Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientific papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1
2 0.13655509 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers
Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to find documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the specific word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the specific constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly specific in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and specificity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more specific method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-specific word distributions to capture both general as well as specific information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and specificity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and specificity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-specific “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and specific to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-specific trade-off between the benefits of topic-based abstraction and specific word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classification problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of specific individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and specificity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are fixed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Griffiths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efficiently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution specific for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-specific multinomial Ψ or a corpus specific multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-specific multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufficient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-specific Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the final word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are specific to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included first and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were fit using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more specific details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is defined as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-specific parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity figures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have significantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both specific and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often specific query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more specific matches than either LDA or LSI, indicating that they have the ability to match at the level of specific words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) fields. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is first mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical significance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not significant. In figure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically significant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically significant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and specific aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account specific aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner specific to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very specific query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Griffiths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-specific scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classifiers. Journal of the Artificial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Griffiths, T. L., and Steyvers, M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.
3 0.096028723 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
4 0.088655733 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction
Author: Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, Zhao Xu
Abstract: We introduce a Gaussian process (GP) framework, stochastic relational models (SRM), for learning social, physical, and other relational phenomena where interactions between entities are observed. The key idea is to model the stochastic structure of entity relationships (i.e., links) via a tensor interaction of multiple GPs, each defined on one type of entities. These models in fact define a set of nonparametric priors on infinite dimensional tensor matrices, where each element represents a relationship between a tuple of entities. By maximizing the marginalized likelihood, information is exchanged between the participating GPs through the entire relational network, so that the dependency structure of links is messaged to the dependency of entities, reflected by the adapted GP kernels. The framework offers a discriminative approach to link prediction, namely, predicting the existences, strengths, or types of relationships based on the partially observed linkage network as well as the attributes of entities (if given). We discuss properties and variants of SRM and derive an efficient learning algorithm. Very encouraging experimental results are achieved on a toy problem and a user-movie preference link prediction task. In the end we discuss extensions of SRM to general relational learning tasks. 1
5 0.086496033 5 nips-2006-A Kernel Method for the Two-Sample-Problem
Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola
Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
6 0.068666741 66 nips-2006-Detecting Humans via Their Pose
7 0.060855389 169 nips-2006-Relational Learning with Gaussian Processes
8 0.049619418 39 nips-2006-Balanced Graph Matching
9 0.048672128 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
10 0.046450164 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
11 0.044657703 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
12 0.043156147 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
13 0.03880889 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
14 0.038222313 7 nips-2006-A Local Learning Approach for Clustering
15 0.036368739 130 nips-2006-Max-margin classification of incomplete data
16 0.036283802 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations
17 0.034736857 153 nips-2006-Online Clustering of Moving Hyperplanes
18 0.03390919 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
19 0.033321217 139 nips-2006-Multi-dynamic Bayesian Networks
20 0.032774959 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation
topicId topicWeight
[(0, -0.108), (1, 0.029), (2, 0.033), (3, -0.018), (4, 0.039), (5, -0.006), (6, 0.047), (7, 0.028), (8, -0.006), (9, -0.012), (10, -0.071), (11, 0.076), (12, 0.026), (13, -0.023), (14, -0.016), (15, -0.011), (16, -0.078), (17, -0.054), (18, 0.054), (19, -0.09), (20, -0.103), (21, -0.17), (22, 0.137), (23, 0.154), (24, -0.055), (25, -0.085), (26, -0.164), (27, -0.125), (28, 0.183), (29, -0.049), (30, 0.002), (31, -0.242), (32, -0.038), (33, 0.026), (34, 0.108), (35, -0.007), (36, 0.035), (37, 0.016), (38, 0.131), (39, 0.053), (40, 0.008), (41, -0.013), (42, 0.078), (43, 0.029), (44, 0.154), (45, 0.079), (46, -0.08), (47, 0.091), (48, -0.045), (49, -0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.97532219 166 nips-2006-Recursive Attribute Factoring
Author: Deepak Verma, Karl Pfleger, David Tax
Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientific papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1
2 0.72922248 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers
Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to find documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the specific word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the specific constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly specific in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and specificity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more specific method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-specific word distributions to capture both general as well as specific information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and specificity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and specificity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-specific “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and specific to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-specific trade-off between the benefits of topic-based abstraction and specific word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classification problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of specific individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and specificity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are fixed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Griffiths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efficiently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution specific for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-specific multinomial Ψ or a corpus specific multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-specific multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufficient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-specific Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the final word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are specific to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included first and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were fit using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more specific details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is defined as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-specific parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity figures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have significantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both specific and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often specific query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more specific matches than either LDA or LSI, indicating that they have the ability to match at the level of specific words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) fields. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is first mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical significance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not significant. In figure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically significant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically significant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and specific aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account specific aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner specific to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very specific query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Griffiths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-specific scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classifiers. Journal of the Artificial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Griffiths, T. L., and Steyvers, M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.
3 0.57375395 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
4 0.41648901 5 nips-2006-A Kernel Method for the Two-Sample-Problem
Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola
Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
5 0.41229859 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction
Author: Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, Zhao Xu
Abstract: We introduce a Gaussian process (GP) framework, stochastic relational models (SRM), for learning social, physical, and other relational phenomena where interactions between entities are observed. The key idea is to model the stochastic structure of entity relationships (i.e., links) via a tensor interaction of multiple GPs, each defined on one type of entities. These models in fact define a set of nonparametric priors on infinite dimensional tensor matrices, where each element represents a relationship between a tuple of entities. By maximizing the marginalized likelihood, information is exchanged between the participating GPs through the entire relational network, so that the dependency structure of links is messaged to the dependency of entities, reflected by the adapted GP kernels. The framework offers a discriminative approach to link prediction, namely, predicting the existences, strengths, or types of relationships based on the partially observed linkage network as well as the attributes of entities (if given). We discuss properties and variants of SRM and derive an efficient learning algorithm. Very encouraging experimental results are achieved on a toy problem and a user-movie preference link prediction task. In the end we discuss extensions of SRM to general relational learning tasks. 1
6 0.35877338 169 nips-2006-Relational Learning with Gaussian Processes
7 0.33009547 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes
8 0.28799909 66 nips-2006-Detecting Humans via Their Pose
9 0.26180559 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
10 0.26173505 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
11 0.25992879 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
12 0.25684237 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding
13 0.23110908 139 nips-2006-Multi-dynamic Bayesian Networks
14 0.22779948 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
15 0.21691173 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
16 0.21314853 39 nips-2006-Balanced Graph Matching
17 0.2076066 129 nips-2006-Map-Reduce for Machine Learning on Multicore
18 0.20102906 98 nips-2006-Inferring Network Structure from Co-Occurrences
19 0.19893907 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing
20 0.19516712 53 nips-2006-Combining causal and similarity-based reasoning
topicId topicWeight
[(1, 0.069), (3, 0.021), (7, 0.05), (9, 0.044), (12, 0.014), (20, 0.02), (22, 0.049), (44, 0.077), (57, 0.102), (65, 0.027), (69, 0.02), (71, 0.024), (86, 0.379), (90, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.80615813 166 nips-2006-Recursive Attribute Factoring
Author: Deepak Verma, Karl Pfleger, David Tax
Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientific papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1
2 0.59698653 130 nips-2006-Max-margin classification of incomplete data
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers for structurally incomplete data, where some objects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired objective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex objective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. 1
3 0.39913091 5 nips-2006-A Kernel Method for the Two-Sample-Problem
Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola
Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
4 0.39649701 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf
Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1
5 0.39440402 34 nips-2006-Approximate Correspondences in High Dimensions
Author: Kristen Grauman, Trevor Darrell
Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel. 1
6 0.39366975 154 nips-2006-Optimal Change-Detection and Spiking Neurons
7 0.39305821 175 nips-2006-Simplifying Mixture Models through Function Approximation
8 0.39229986 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
9 0.3911317 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
10 0.39110562 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
11 0.38928989 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
12 0.38828966 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
13 0.38683814 42 nips-2006-Bayesian Image Super-resolution, Continued
14 0.38566723 167 nips-2006-Recursive ICA
15 0.38520068 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
16 0.38484731 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
17 0.38445041 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
18 0.38420144 158 nips-2006-PG-means: learning the number of clusters in data
19 0.38396421 65 nips-2006-Denoising and Dimension Reduction in Feature Space
20 0.38202018 43 nips-2006-Bayesian Model Scoring in Markov Random Fields