nips nips2006 nips2006-101 knowledge-graph by maker-knowledge-mining

101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow


Source: pdf

Author: Yi Mao, Guy Lebanon

Abstract: We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the M¨ bius transform, we express the model o as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. [sent-7, score-1.009]

2 Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. [sent-8, score-0.183]

3 In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. [sent-9, score-0.215]

4 Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. [sent-11, score-0.95]

5 Many documents, such as reviews and blogs, are written with the purpose of conveying a particular opinion or sentiment. [sent-13, score-0.093]

6 Other documents may not be written with the purpose of conveying an opinion, but nevertheless they contain one. [sent-14, score-0.07]

7 Predicting the document’s sentiment would allow matching the sentiment, as well as the topic, with the user’s interests. [sent-17, score-0.907]

8 It would also assist in document summarization and visualization. [sent-18, score-0.105]

9 [1] demonstrated the difficulties in sentiment prediction using solely the empirical rules (a subset of adjectives), which motivates the use of statistical learning techniques. [sent-21, score-0.956]

10 The task was then refined to allow multiple sentiment levels, facilitating the use of standard text categorization techniques [2]. [sent-22, score-0.942]

11 Indeed, sentiment prediction is a much harder task than topic classification tasks such as Reuters or WebKB and current models achieve lower accuracy. [sent-24, score-0.959]

12 Rather than using a bag of words multiclass classifier, we model the sequential flow of sentiment throughout the document using a sequential conditional model. [sent-25, score-1.06]

13 Furthermore, we treat the sentiment labels as ordinal variables by enforcing monotonicity constraints on the model’s parameters. [sent-26, score-1.071]

14 2 Local and Global Sentiments Previous research on sentiment prediction has generally focused on predicting the sentiment of the entire document. [sent-27, score-1.882]

15 Typically, the problem is considered as standard multiclass classification or regression using the bag of words representation. [sent-29, score-0.076]

16 In addition to the sentiment of the entire document, which we call global sentiment, we define the concept of local sentiment as the sentiment associated with a particular part of the text. [sent-30, score-2.785]

17 It is reasonable to assume that the global sentiment of a document is a function of the local sentiment and that estimating the local sentiment is a key step in predicting the global sentiment. [sent-31, score-2.923]

18 Moreover, the concept of local sentiment is useful in a wide range of text analysis applications including document summarization and visualization. [sent-32, score-1.071]

19 Formally, we view local sentiment as a function on the words in a document taking values in a finite partially ordered set, or a poset, (O, ≤). [sent-33, score-1.046]

20 To determine the local sentiment at a particular word, it is necessary to take context into account. [sent-34, score-0.945]

21 For example, due to context the local sentiment at each of the following words this is a horrible product is low (in the sense of (O, ≤)). [sent-35, score-0.981]

22 Since sentences are natural components for segmenting document semantics, we view local sentiment as a piecewise constant function on sentences. [sent-36, score-1.124]

23 Occasionally we encounter a sentence that violates this rule and conveys opposing sentiments in two different parts. [sent-37, score-0.134]

24 We therefore formalize the problem as predicting a sequence of sentiments y = (y1 , . [sent-39, score-0.134]

25 , yn ), yi ∈ O based on a sequence of sentences x = (x1 , . [sent-42, score-0.259]

26 Modeling the local sentiment is challenging from several aspects. [sent-46, score-0.945]

27 The sentence sequence x is discrete-time and high-dimensional categorical valued, and the sentiment sequence y is discretetime and ordinal valued. [sent-47, score-1.114]

28 In this paper we demonstrate the prediction of local sentiment flow using an ordinal version of conditional random fields, and explore the relation between the local and global sentiment. [sent-50, score-1.16]

29 CRF have been mostly applied to sequence annotation, where x is a sequence of words and y is a sequence of labels annotating the words, for example part-of-speech tags. [sent-53, score-0.121]

30 In other words, the cliques are C = {{yi−1 , yi }, {yi , xi } : i = 1, . [sent-55, score-0.152]

31 , n} (see Figure 1 left) leading to the model pθ (y|x) = 1 exp Z(x, θ) λk fk (yi−1 , yi ) + i k µk gk (yi , xi ) i θ = (λ, µ). [sent-58, score-0.14]

32 (2) k In sequence annotation a standard choice for the feature functions is f σ,τ (yi−1 , yi ) = δyi−1 ,σ δyi ,τ and g σ,w (yi , xi ) = δyi ,σ δxi ,w (note that we index the feature functions using pairs rather than k as in (2)). [sent-59, score-0.176]

33 In our case, since xi are sentences we use instead the slightly modified feature functions g σ,w (yi , xi ) = 1 if yi = σ, w ∈ xi and 0 otherwise. [sent-60, score-0.312]

34 Despite the great popularity of CRF in sequence labeling, they are not appropriate for ordinal data such as sentiments. [sent-62, score-0.114]

35 The ordinal relation is ignored in (2), and in the case of limited training data the parameter estimates will possess high variance resulting in poor predictive power. [sent-63, score-0.106]

36 We therefore enforce a set of monotonicity constraints on the parameters that are consistent with the ordinal structure and domain knowledge. [sent-64, score-0.149]

37 The resulting model is a restricted subset of the CRF (2) and, in accordance with isotonic regression [4], is named isotonic CRF. [sent-65, score-0.4]

38 Since ordinal variables express a progression of some sort, it is natural to expect some of the binary features in (2) to correlate more strongly with some ordinal values than others. [sent-66, score-0.194]

39 In such cases, we should expect the presence of such binary features to increase (or decrease) the conditional probability in a manner consistent with the ordinal relation. [sent-67, score-0.113]

40 Let p(y|x) be a linear state-emission chain CRF with binary features f σ,τ , g σ,w as above, and x a sentence sequence for which v ∈ xj . [sent-76, score-0.074]

41 Since µ tj ,v tj ,v ≥µ ≥µ sj ,v we have that ez−µ sj ,v − ez−µ tj ,v ≥ 0 for all z and Ep(y |x) e =⇒ sj ,v µ y ,v j By Proposition 1 the above expectation is −µ sj ,v p(s|x) p(s|x ) −e − µ y ,v j p(t|x) p(t|x ) −µ tj ,v ≥ 0. [sent-93, score-0.208]

42 Conceptually, the parameter estimates for isotonic CRF may be found by maximizing the likelihood or posterior subject to the monotonicity constraints (3)-(4). [sent-99, score-0.233]

43 The word emissions yi → xi in the CRF structure are not expected to vary much across different authors. [sent-111, score-0.157]

44 The sentiment transitions yi−1 → yi , on the other hand, typically vary across different authors as a consequence of their individual styles. [sent-112, score-1.032]

45 For example, the review of an author who sticks to a list of self-ranked evaluation criteria is prone to strong sentiment variations. [sent-113, score-0.987]

46 In contrast, the review of an author who likes to enumerate pros before he gets to cons (or vice versa) is likely to exhibit more local homogeneity in sentiment. [sent-114, score-0.106]

47 Accounting for author-specific sentiment transition style leads to the graphical model in Figure 1 right. [sent-115, score-0.94]

48 However, it is straightforward to combine author-specific models with isotonic restrictions. [sent-120, score-0.194]

49 Experiments demonstrating author-specific isotonic models are described in Section 4. [sent-121, score-0.194]

50 2 Sentiment Flows as Smooth Curves The sentence-based definition of sentiment flow is problematic when we want to fit a model (for example to predict global sentiment) that uses sentiment flows from multiple documents. [sent-125, score-1.84]

51 Different documents have different number of sentences and it is not clear how to compare them or how to build a model from a collection of discrete flows of different lengths. [sent-126, score-0.151]

52 We assume from now on that the ordinal set O is realized as a subset of R and that its ordering coincides with the standard ordering on R. [sent-128, score-0.13]

53 In order to account for different lengths, we consider the sentiment flow as a function h : [0, 1] → O ⊂ R that is piecewise constant on the intervals [0, l), [l, 2l), . [sent-129, score-0.92]

54 , [(k − 1)l, 1] where k is the number of sentences in the document and l = 1/k. [sent-132, score-0.166]

55 The resulting sentiment flow is a smooth curve f : [0, 1] → R that can be easily related or compared to similar sentiment flows of other documents (see Figure 3 for an example). [sent-135, score-1.878]

56 We can then define natural distances between two flows, for example the L p distance 1/p 1 |f1 (r) − f2 (r)|p dr dp (f1 , f2 ) = (7) 0 for use in a k-nearest neighbor model for relating the local sentiment flow to the global sentiment. [sent-136, score-0.99]

57 4 Experiments To examine the ideas proposed in this paper we implemented isotonic CRF, and the normalization and smoothing procedure, and experimented with a small dataset of 249 movie reviews, randomly selected from the Cornell sentence polarity dataset v1. [sent-137, score-0.319]

58 The code for isotonic CRF is a modified version of the quasi-Newton implementation in the Mallet toolkit. [sent-139, score-0.194]

59 In order to check the accuracy and benefit of the local sentiment predictor, we hand-labeled the local sentiments of each of these reviews. [sent-140, score-1.08]

60 1 Sentence Level Prediction To evaluate the prediction quality of the local sentiment we compared the performance of naive Bayes, SVM (using the default parameters of SVMlight ), CRF and isotonic CRF. [sent-143, score-1.198]

61 Figure 2 displays the testing accuracy and distance of predicting the sentiment of sentences as a function of the training data size averaged over 20 cross-validation train-test split. [sent-144, score-1.089]

62 The dataset presents one particular difficulty where more than 75% of the sentences are labeled objective (or 0). [sent-145, score-0.144]

63 As a result, the prediction accuracy for objective sentences is over-emphasized. [sent-146, score-0.197]

64 To correct for this fact, we report our test-set performance over a balanced (equal number of sentences for different labels) sample of labeled sentences. [sent-147, score-0.141]

65 25 isotonic CRFs CRFs SVM naive Bayes isotonic CRFs CRFs SVM naive Bayes 1. [sent-156, score-0.43]

66 95 25 50 75 100 125 150 175 Figure 2: Local sentiment prediction: balanced test results for naive Bayes, SVM, CRF and iso-CRF. [sent-165, score-0.947]

67 As described in Section 3, for isotonic CRF, we obtained 300 words to enforce monotonicity constraints. [sent-166, score-0.287]

68 The 150 words that achieved the highest correlation with the sentiment were chosen for positivity constraints. [sent-167, score-0.943]

69 This criterion is based on the observation that in sentiment prediction, the cost of misprediction is influenced by the ordinal relation on the labels, rather than the 0-1 error rate. [sent-173, score-0.999]

70 2 Global Sentiment Prediction We also evaluated the contribution of the local sentiment analysis in helping to predict the global sentiment of documents. [sent-175, score-1.878]

71 We compared a nearest neighbor classifier for the global sentiment, where the representation varied from bag of words to smoothed length-normalized local sentiment representation (with and without objective sentences). [sent-176, score-1.12]

72 Figure 3 displays discrete and smoothed local sentiment labels, and the smoothed sentiment flow predicted by isotonic CRF. [sent-179, score-2.136]

73 Figure 4 and Table 2 display test-set accuracy of global sentiments as a function of the train set size. [sent-180, score-0.123]

74 The distance in the nearest neighbor classifier was either L1 or L2 for the bag of words representation or their continuous version (7) for the smoothed sentiment curve representation. [sent-181, score-1.058]

75 The results indicate that the classification performance of the local sentiment representation is better than the bag of words representation. [sent-182, score-1.009]

76 In accordance with the conclusion of [6], removing objective sentences (that correspond to sentiment 0) increased the local sentiment analysis performance by 20. [sent-183, score-1.996]

77 We can thus conclude that for the purpose of global sentiment prediction, local sentiment flow of the nonobjective sentences holds most of the relevant information. [sent-185, score-2.011]

78 Performing local sentiment analysis on non-objective sentences improves performance as the model estimates possess lower variance. [sent-186, score-1.081]

79 3 Measuring the rate of sentiment change We examine the rate of sentiment change as a characterization of the author’s writing style using the isotonic author-dependent model of Section 3. [sent-188, score-2.043]

80 We assume that the CRF process is a discrete 2 labels curve rep of labels predicted curve rep 1 0 −1 −2 0 0. [sent-190, score-0.135]

81 The blue circles indicate the labeled sentiment of each sentence. [sent-200, score-0.907]

82 The blue solid curve and red dashed curve are smoothed representations of the labeled and predicted sentiment flows. [sent-201, score-0.999]

83 The numberings correspond to sentences displayed in Section 4. [sent-203, score-0.122]

84 38 sentiment flow w/o objective sentiment flow w/ objective vocabulary sentiment flow w/o objective sentiment flow w/ objective vocabulary 0. [sent-207, score-3.872]

85 26 25 50 75 100 125 150 175 Figure 4: Accuracy of global sentiment prediction (4-class labeling) as a function of train set size. [sent-218, score-0.971]

86 A consequence of this assumption is that the time T the author stays in sentiment σ before leaving is modeled by the exponential distribution pσ (T > t) = e−qσ (t−1) , t > 1. [sent-220, score-0.961]

87 Here, we assume T > 1 and qσ is interpreted as the rate of change of the sentiment σ ∈ O: the larger the value, the more likely the author will switch to other sentiments in the near future. [sent-221, score-1.043]

88 To estimate the rate of change qσ of an author we need to compute pσ (T > t) based on the marginal probabilities p(s|a) of sentiment sequences s of length l. [sent-222, score-0.961]

89 The data was the 249 movie reviews from the previous experiments written by one author, and additional 201 movie reviews from a second author. [sent-228, score-0.122]

90 Interestingly, the author associated with the red dashed line has a consistent lower qσ value in all those figures, and thus is considered as more “static” and less prone to quick sentiment variations. [sent-229, score-0.973]

91 vocabulary sentiment flow with objective sentences sentiment flow without objective sentences L1 0. [sent-230, score-2.12]

92 4 Text Summarization We demonstrate the potential usage of sentiment flow for text summarization with a very simple example. [sent-252, score-0.989]

93 The text below shows the result of summarizing the movie review in Figure 3 by keeping only sentences associated with the start, the end, the top, and the bottom of the predicted sentiment curve. [sent-253, score-1.108]

94 Alternative schemes for extracting specific sentences may be used to achieve different effects, depending on the needs of the user. [sent-260, score-0.122]

95 We plan to experiment further in this area by combining local sentiment flow and standard summarization techniques. [sent-261, score-1.006]

96 5 Discussion In this paper, we address the prediction and application of the local sentiment flow concept. [sent-262, score-0.983]

97 As existing models are inadequate for a variety of reasons, we introduce the isotonic CRF model that is suited to predict the local sentiment flow. [sent-263, score-1.139]

98 We also demonstrate the usefulness of the local sentiment representation for global sentiment prediction, style analysis and text summarization. [sent-265, score-1.921]

99 Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. [sent-276, score-0.921]

100 Statistical inference under order restrictions; the theory and application of isotonic regression. [sent-293, score-0.194]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sentiment', 0.907), ('isotonic', 0.194), ('crf', 0.161), ('sentences', 0.122), ('yi', 0.115), ('ordinal', 0.092), ('sentiments', 0.082), ('ow', 0.081), ('summarization', 0.061), ('author', 0.054), ('sentence', 0.052), ('document', 0.044), ('monotonicity', 0.039), ('prediction', 0.038), ('local', 0.038), ('words', 0.036), ('pang', 0.032), ('yj', 0.031), ('movie', 0.031), ('smoothed', 0.031), ('bius', 0.031), ('ows', 0.03), ('flow', 0.03), ('predicting', 0.03), ('documents', 0.029), ('crfs', 0.028), ('bag', 0.028), ('sj', 0.028), ('proposition', 0.027), ('opinion', 0.027), ('global', 0.026), ('xi', 0.025), ('reviews', 0.025), ('characters', 0.024), ('tj', 0.024), ('curve', 0.024), ('restriction', 0.023), ('objective', 0.022), ('sequence', 0.022), ('style', 0.022), ('ordered', 0.021), ('naive', 0.021), ('text', 0.021), ('conditional', 0.021), ('conveying', 0.02), ('lafayette', 0.02), ('poset', 0.02), ('lm', 0.02), ('ordering', 0.019), ('neighbor', 0.019), ('labeling', 0.019), ('labels', 0.019), ('categorical', 0.019), ('balanced', 0.019), ('elds', 0.018), ('enforce', 0.018), ('vocabulary', 0.018), ('rep', 0.018), ('purdue', 0.018), ('emotionally', 0.018), ('classifier', 0.018), ('lebanon', 0.018), ('something', 0.017), ('word', 0.017), ('polarity', 0.016), ('accuracy', 0.015), ('guessing', 0.015), ('lives', 0.015), ('displays', 0.015), ('possess', 0.014), ('magazine', 0.014), ('opinions', 0.014), ('enforcing', 0.014), ('stars', 0.014), ('review', 0.014), ('topic', 0.014), ('categorization', 0.014), ('annotation', 0.014), ('svm', 0.013), ('bayes', 0.013), ('nearest', 0.013), ('smoothing', 0.013), ('ez', 0.013), ('piecewise', 0.013), ('predicted', 0.013), ('examine', 0.013), ('prone', 0.012), ('west', 0.012), ('cliques', 0.012), ('restrictions', 0.012), ('lost', 0.012), ('regression', 0.012), ('sequential', 0.012), ('smooth', 0.011), ('motivates', 0.011), ('graphical', 0.011), ('purpose', 0.011), ('written', 0.01), ('authors', 0.01), ('strongly', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

Author: Yi Mao, Guy Lebanon

Abstract: We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the M¨ bius transform, we express the model o as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. 1

2 0.084399126 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan

Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1

3 0.075806528 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

4 0.075295039 156 nips-2006-Ordinal Regression by Extended Binary Classification

Author: Ling Li, Hsuan-tien Lin

Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1

5 0.054512899 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo

Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1

6 0.045866352 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

7 0.034081221 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

8 0.033536438 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

9 0.033430226 186 nips-2006-Support Vector Machines on a Budget

10 0.03247463 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

11 0.030997217 122 nips-2006-Learning to parse images of articulated bodies

12 0.029991031 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

13 0.029050509 130 nips-2006-Max-margin classification of incomplete data

14 0.028944125 20 nips-2006-Active learning for misspecified generalized linear models

15 0.028109049 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations

16 0.026592091 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

17 0.025963442 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

18 0.02578458 166 nips-2006-Recursive Attribute Factoring

19 0.025112189 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

20 0.024942433 66 nips-2006-Detecting Humans via Their Pose


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.088), (1, 0.025), (2, 0.012), (3, -0.013), (4, 0.011), (5, 0.054), (6, -0.029), (7, -0.009), (8, -0.028), (9, -0.032), (10, -0.101), (11, -0.039), (12, -0.058), (13, 0.042), (14, -0.038), (15, 0.072), (16, 0.025), (17, -0.06), (18, 0.009), (19, -0.045), (20, 0.045), (21, -0.049), (22, -0.002), (23, -0.016), (24, 0.017), (25, -0.043), (26, -0.032), (27, -0.036), (28, 0.042), (29, -0.032), (30, -0.067), (31, -0.06), (32, -0.082), (33, -0.053), (34, 0.055), (35, -0.011), (36, -0.01), (37, -0.024), (38, -0.103), (39, 0.014), (40, 0.001), (41, 0.006), (42, -0.084), (43, 0.048), (44, 0.085), (45, -0.083), (46, -0.014), (47, -0.033), (48, -0.094), (49, -0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90490091 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

Author: Yi Mao, Guy Lebanon

Abstract: We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the M¨ bius transform, we express the model o as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. 1

2 0.6237421 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

Author: Jade P. Vinson, David Decaprio, Matthew D. Pearson, Stacey Luoma, James E. Galagan

Abstract: Computational gene prediction using generative models has reached a plateau, with several groups converging to a generalized hidden Markov model (GHMM) incorporating phylogenetic models of nucleotide sequence evolution. Further improvements in gene calling accuracy are likely to come through new methods that incorporate additional data, both comparative and species specific. Conditional Random Fields (CRFs), which directly model the conditional probability P (y|x) of a vector of hidden states conditioned on a set of observations, provide a unified framework for combining probabilistic and non-probabilistic information and have been shown to outperform HMMs on sequence labeling tasks in natural language processing. We describe the use of CRFs for comparative gene prediction. We implement a model that encapsulates both a phylogenetic-GHMM (our baseline comparative model) and additional non-probabilistic features. We tested our model on the genome sequence of the fungal human pathogen Cryptococcus neoformans. Our baseline comparative model displays accuracy comparable to the the best available gene prediction tool for this organism. Moreover, we show that discriminative training and the incorporation of non-probabilistic evidence significantly improve performance. Our software implementation, Conrad, is freely available with an open source license at http://www.broad.mit.edu/annotation/conrad/. 1

3 0.60834545 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

4 0.45913237 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

Author: Chi-hoon Lee, Shaojun Wang, Feng Jiao, Dale Schuurmans, Russell Greiner

Abstract: We present a novel, semi-supervised approach to training discriminative random fields (DRFs) that efficiently exploits labeled and unlabeled training data to achieve improved accuracy in a variety of image processing tasks. We formulate DRF training as a form of MAP estimation that combines conditional loglikelihood on labeled data, given a data-dependent prior, with a conditional entropy regularizer defined on unlabeled data. Although the training objective is no longer concave, we develop an efficient local optimization procedure that produces classifiers that are more accurate than ones based on standard supervised DRF training. We then apply our semi-supervised approach to train DRFs to segment both synthetic and real data sets, and demonstrate significant improvements over supervised DRFs in each case.

5 0.42626324 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers

Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to find documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the specific word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the specific constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly specific in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and specificity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more specific method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-specific word distributions to capture both general as well as specific information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and specificity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and specificity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-specific “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and specific to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-specific trade-off between the benefits of topic-based abstraction and specific word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classification problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of specific individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and specificity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are fixed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Griffiths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efficiently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution specific for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-specific multinomial Ψ or a corpus specific multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-specific multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufficient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-specific Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the final word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are specific to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included first and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were fit using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more specific details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is defined as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-specific parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity figures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have significantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both specific and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often specific query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more specific matches than either LDA or LSI, indicating that they have the ability to match at the level of specific words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) fields. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is first mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical significance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not significant. In figure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically significant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically significant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and specific aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account specific aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner specific to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very specific query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Griffiths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-specific scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classifiers. Journal of the Artificial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Griffiths, T. L., and Steyvers, M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.

6 0.40099901 117 nips-2006-Learning on Graph with Laplacian Regularization

7 0.39818141 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

8 0.39747488 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

9 0.39546695 129 nips-2006-Map-Reduce for Machine Learning on Multicore

10 0.38010994 150 nips-2006-On Transductive Regression

11 0.37806726 156 nips-2006-Ordinal Regression by Extended Binary Classification

12 0.36828119 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

13 0.35646871 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

14 0.34202635 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

15 0.33950531 122 nips-2006-Learning to parse images of articulated bodies

16 0.33818087 186 nips-2006-Support Vector Machines on a Budget

17 0.33559182 104 nips-2006-Large-Scale Sparsified Manifold Regularization

18 0.32784331 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

19 0.30358034 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

20 0.30322182 66 nips-2006-Detecting Humans via Their Pose


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.097), (3, 0.017), (7, 0.076), (9, 0.035), (22, 0.061), (44, 0.07), (57, 0.066), (65, 0.061), (69, 0.03), (75, 0.256), (90, 0.053), (93, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76926053 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

Author: Yi Mao, Guy Lebanon

Abstract: We examine the problem of predicting local sentiment flow in documents, and its application to several areas of text analysis. Formally, the problem is stated as predicting an ordinal sequence based on a sequence of word sets. In the spirit of isotonic regression, we develop a variant of conditional random fields that is wellsuited to handle this problem. Using the M¨ bius transform, we express the model o as a simple convex optimization problem. Experiments demonstrate the model and its applications to sentiment prediction, style analysis, and text summarization. 1

2 0.68184924 185 nips-2006-Subordinate class recognition using relational object models

Author: Aharon B. Hillel, Daphna Weinshall

Abstract: We address the problem of sub-ordinate class recognition, like the distinction between different types of motorcycles. Our approach is motivated by observations from cognitive psychology, which identify parts as the defining component of basic level categories (like motorcycles), while sub-ordinate categories are more often defined by part properties (like ’jagged wheels’). Accordingly, we suggest a two-stage algorithm: First, a relational part based object model is learnt using unsegmented object images from the inclusive class (e.g., motorcycles in general). The model is then used to build a class-specific vector representation for images, where each entry corresponds to a model’s part. In the second stage we train a standard discriminative classifier to classify subclass instances (e.g., cross motorcycles) based on the class-specific vector representation. We describe extensive experimental results with several subclasses. The proposed algorithm typically gives better results than a competing one-step algorithm, or a two stage algorithm where classification is based on a model of the sub-ordinate class. 1

3 0.56451535 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

4 0.56439263 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1

5 0.56198573 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

Author: Andrea Vedaldi, Stefano Soatto

Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1

6 0.56052482 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

7 0.55991739 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

8 0.55959725 117 nips-2006-Learning on Graph with Laplacian Regularization

9 0.55850369 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

10 0.55800951 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

11 0.55673623 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

12 0.55615401 175 nips-2006-Simplifying Mixture Models through Function Approximation

13 0.55563658 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

14 0.55395079 61 nips-2006-Convex Repeated Games and Fenchel Duality

15 0.55346835 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

16 0.5530256 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

17 0.5529983 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

18 0.55264521 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

19 0.55177879 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

20 0.54978973 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields