nips nips2002 nips2002-163 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. [sent-4, score-0.522]
2 We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. [sent-5, score-0.505]
3 1 Introduction Many cognitive capacities, such as memory and categorization, can be analyzed as systems for efficiently predicting aspects of an organism's environment [1]. [sent-6, score-0.168]
4 Previously, such analyses have been concerned with memory for facts or the properties of objects, where the prediction task involves identifying when those facts might be needed again, or what properties novel objects might possess. [sent-7, score-0.086]
5 However, one of the most challenging tasks people face is linguistic communication. [sent-8, score-0.104]
6 Engaging in conversation or reading a passage of text requires retrieval of a variety of concepts from memory in response to a stream of information. [sent-9, score-0.158]
7 This retrieval task can be facilitated by predicting which concepts are likely to be needed from their context, having efficiently abstracted and stored the cues that support these predictions. [sent-10, score-0.161]
8 In this paper, we examine how understanding the problem of predicting words from their context can provide insight into human semantic association, exploring the hypothesis that the association between words is at least partially affected by their statistical relationships. [sent-11, score-1.21]
9 Several researchers have argued that semantic association can be captured using high-dimensional spatial representations , with the most prominent such approach being Latent Semantic Analysis (LSA) [5]. [sent-12, score-0.47]
10 We will describe this procedure, which indirectly addresses the prediction problem. [sent-13, score-0.096]
11 We will then suggest an alternative approach which explicitly models the way language is generated and show that this approach provides a better account of human word association data than LSA, although the two approaches are closely related. [sent-14, score-0.77]
12 The great promise of this approach is that it illustrates how we might begin to relax some of the strong assumptions about language made by many corpus-based methods. [sent-15, score-0.089]
13 2 Latent Semantic Analysis Latent Semantic Analysis addresses the prediction problem by capturing similarity in word usage: seeing a word suggests that we should expect to see other words with similar usage patterns. [sent-17, score-1.217]
14 Given a corpus containing W words and D documents, the input to LSA is a W x D word-document co-occurrence matrix F in which fwd corresponds to the frequency with which word w occurred in document d. [sent-18, score-0.961]
15 This matrix is transformed to a matrix G via some function involving the term frequency fwd and its frequency across documents fw . [sent-19, score-0.414]
16 logD w - ' (1) where Hw is the normalized entropy of the distribution over documents for each word. [sent-22, score-0.116]
17 Singular value decomposition (SVD) is applied to G to extract a lower dimensional linear subspace that captures much of the variation in usage across words. [sent-23, score-0.07]
18 The association between two words is typically assessed using the cosine of the angle between their vectors, a measure that appears to produce psychologically accurate results on a variety of tasks [5] . [sent-25, score-0.62]
19 Our subset used all D = 37651 documents, and the W = 26414 words that occurred at least ten times in the whole corpus, with stop words removed. [sent-27, score-0.488]
20 1 3 The topic model Latent Semantic Analysis gives results that seem consistent with human judgments and extracts information relevant to predicting words from their contexts, although it was not explicitly designed with prediction in mind. [sent-29, score-0.848]
21 This relationship suggests that a closer correspondence to human data might be obtained by directly attempting to solve the prediction task. [sent-30, score-0.229]
22 One generative model that has been used to outperform LSA on information retrieval tasks views documents as being composed of sets of topics [2,4]. [sent-32, score-0.504]
23 The words likely to be used in a new context can be determined by estimating the distribution over topics for that context, corresponding to P(Zi). [sent-34, score-0.61]
24 Intuitively, P(wlz = j) indicates which words are important to a topic, while P(z) is the prevalence of those topics within a document. [sent-35, score-0.545]
25 For example, imagine a world where the only topics of conversation are love and research. [sent-36, score-0.469]
26 LSA performed best on the word association task with around 500 dimensions, so we used the same dimensionality for the topic model. [sent-38, score-0.96]
27 the probability distribution over words with two topics, one relating to love and the other to research. [sent-39, score-0.385]
28 The content of the topics would be reflected in P(wlz = j): the love topic would give high probability to words like JOY, PLEASURE, or HEART, while the research topic would give high probability to words like SCIENCE, MATHEMATICS, or EXPERIMENT. [sent-40, score-1.639]
29 Whether a particular conversation concerns love, research, or the love of research would depend upon its distribution over topics, P(z), which determines how these topics are mixed together in forming documents. [sent-41, score-0.498]
30 Having defined a generative model, learning topics becomes a statistical problem. [sent-42, score-0.351]
31 , w n }, where each Wi belongs to some document di , as in a word-document co-occurrence matrix. [sent-46, score-0.097]
32 For each document we have a multinomial distribution over the T topics, with parameters ()(d), so for a word in document d, P(Zi = j) = ();d;). [sent-47, score-0.653]
33 The jth topic is represented by a multinomial distribution over the W words in the vocabulary, with parameters 1/i), so P(wilzi = j) = 1>W. [sent-48, score-0.716]
34 Here, we present a novel approach to inference in this model, using Markov chain Monte Carlo with a symmetric Dirichlet(a) prior on ()(di) for all documents and a symmetric Dirichlet(,B) prior on 1>(j) for all topics. [sent-51, score-0.131]
35 We use Gibbs sampling, where each state is an assignment of values to the variables being sampled, and the next state is reached by sequentially sampling all variables from their distribution when conditioned on the current values of all other variables and the data. [sent-53, score-0.119]
36 We will sample only the assignments of words to topics, Zi. [sent-54, score-0.244]
37 (3) i, and n~~:j is the number of words assigned to topic j that are the same as w, n~L is the total number of words assigned to topic j, n~J,j is the number of words from document d assigned to topic j, and n~J. [sent-58, score-1.936]
38 is the total number of words in document d, all not counting the assignment of the current word Wi. [sent-59, score-0.766]
39 We applied this algorithm to our subset of the TASA corpus, which contains n = 5628867 word tokens. [sent-61, score-0.385]
40 2 Each sample consists of an assignment of every word token to a topic, giving a value to each Zi. [sent-65, score-0.425]
41 A subset of the 500 topics found in a single sample are shown in Table 1. [sent-66, score-0.301]
42 J)_ _ nj (4) Predicting word association We used both LSA and the topic model to predict the association between pairs of words, comparing these results with human word association norms collected by Nelson, McEvoy and Schreiber [7]. [sent-73, score-1.958]
43 These word association norms were established by presenting a large number of participants with a cue word and asking them to name an associated word in response. [sent-74, score-1.509]
44 A total of 4544 of the words in these norms appear in the set of 26414 taken from the TASA corpus. [sent-75, score-0.363]
45 1 Latent Semantic Analysis In LSA, the association between two words is usually measured using the cosine of the angle between their vectors. [sent-77, score-0.564]
46 We ordered the associates of each word in the norms by their frequencies , making the first associate the word most commonly given as a response to the cue. [sent-78, score-1.09]
47 For example, the first associate of NEURON is BRAIN. [sent-79, score-0.103]
48 We evaluated the cosine between each word and the other 4543 words in the norms , and then computed the rank of the cosine of each of the first ten associates, or all of the associates for words with less than ten. [sent-80, score-1.393]
49 Small ranks indicate better performance, with a rank of one meaning that the target word had the highest cosine. [sent-82, score-0.503]
50 The median rank of the first associate was 32, and LSA correctly predicted the first associate for 507 of the 4544 words. [sent-83, score-0.318]
51 2 The topic model The probabilistic nature of the topic model makes it easy to predict the words likely to occur in a particular context. [sent-85, score-0.982]
52 If we have seen word WI in a document, then we can determine the probability that word W2 occurs in that document by computing P( w2IwI). [sent-86, score-0.867]
53 is extremely important to capturing the complexity of large collections of words and computing the probability of complete documents. [sent-89, score-0.278]
54 However, when comparing individual words it is more effective to assume that they both come from a single topic. [sent-90, score-0.244]
55 We computed Pi (w2Iwi) for the 4544 words in the norms, and then assessed the rank of the associates in the resulting distribution using the same procedure as for LSA. [sent-93, score-0.444]
56 The median rank for the first associate was 32, with 585 of the 4544 first associates exactly correct. [sent-95, score-0.314]
57 The probabilistic model performed better than LSA, with the improved performance becoming more apparent for the later associates . [sent-96, score-0.098]
58 3 Discussion The central problem in modeling semantic association is capturing the interaction between word frequency and similarity of word usage. [sent-98, score-1.395]
59 Word frequency is an important factor in a variety of cognitive tasks, and one reason for its importance is its predictive utility. [sent-99, score-0.122]
60 A higher observed frequency means that a word should be predicted to occur more often. [sent-100, score-0.503]
61 However, this effect of frequency should be tempered by the relationship between a word and its semantic context . [sent-101, score-0.776]
62 The success of the topic model is a consequence of naturally combining frequency information with semantic similarity: when a word is very diagnostic of a small number of topics, semantic context is used in prediction. [sent-102, score-1.409]
63 The effect of word frequency in the topic model can be seen in the rank-order correlation of the predicted ranks of the first associates with the ranks predicted by word frequency alone , which is p = 0. [sent-104, score-1.643]
64 In contrast, the cosine is used in LSA because it explicitly removes the effect of word frequency, with the corresponding correlation being p = -0. [sent-106, score-0.533]
65 The cosine is purely a measure of semantic similarity, which is useful in situations where word frequency is misleading, such as in tests of English fluency or other linguistic tasks, but not necessarily consistent with human performance. [sent-108, score-0.977]
66 This measure was based in the origins of LSA in information retrieval , but other measures that do incorporate word frequency have been used for modeling psychological data. [sent-109, score-0.512]
67 5 Relating LSA and the topic model The decomposition of a word-document co-occurrence matrix provided by the topic model can be written in a matrix form similar to that of LSA. [sent-111, score-0.792]
68 Given a worddocument co-occurrence matrix F, we can convert the columns into empirical estimates of the distribution over words in each document by dividing each column by its sum. [sent-112, score-0.452]
69 Calling this matrix P, the topic model approximates it with the non/j) negative matrix factorization P ~ ¢O, where column j of ¢ gives 4 , and column d of 0 gives ()(d). [sent-113, score-0.481]
70 The single topic assumption removes the off-diagonal elements, replacing OOT with I to give PI (Wl ' W2) ex: ¢¢T. [sent-115, score-0.369]
71 The locations of the words along the extracted dimensions are X = UD. [sent-117, score-0.244]
72 If the column sums do not vary extensively, the empirical estimate of the joint distribution over words specified by the entries in G will be approximately P(WI,W2) ex: GG T . [sent-118, score-0.302]
73 The properties of the SVD guarantee that XX T , the matrix of inner products among the word vectors , is the best lowrank approximation to GG T in terms of squared error. [sent-119, score-0.478]
74 The transformations in Equation 1 are intended to reduce the effects of word frequency in the resulting representation, making XX T more similar to ¢¢T. [sent-120, score-0.476]
75 We used the inner product between word vectors to predict the word association norms, exactly as for the cosine. [sent-121, score-1.071]
76 The inner product initially shows worse performance than the cosine, with a median rank of 34 for the first associate and 500 exactly correct, but performs better for later associates. [sent-123, score-0.283]
77 The rank-order correlation with the predictions of word frequency for the first associate was p = 0. [sent-124, score-0.579]
78 The rankorder correlation between the ranks given by the inner product and the topic model was p = 0. [sent-126, score-0.535]
79 81, while the cosine and the topic model correlate at p = 0. [sent-127, score-0.483]
80 The inner product and PI (w2lwd in the topic model seem to give quite similar results, despite being obtained by very different procedures. [sent-129, score-0.464]
81 This similarity is emphasized by choosing to assess the models with separate ranks for each cue word, since this measure does not discriminate between joint and conditional probabilities. [sent-130, score-0.13]
82 While the inner product is related to the joint probability of WI and W2, PI (w2lwd is a conditional probability and thus allows reasonable comparisons of the probability of W2 across choices of WI , as well as having properties like asymmetry that are exhibited by word association. [sent-131, score-0.507]
83 6 Exploring more complex generative models The topic model, which explicitly addresses the problem of predicting words from their contexts, seems to show a closer correspondence to human word association than LSA. [sent-133, score-1.566]
84 A major consequence of this analysis is the possibility that we may be able to gain insight into some of the associative aspects of human semantic memory by exploring statistical solutions to this prediction problem. [sent-134, score-0.494]
85 In particular, it may be possible to develop more sophisticated generative models of language that can capture some of the important linguistic distinctions that influence our processing of words. [sent-135, score-0.179]
86 One such assumption is the treatment of a document as a "bag of words" , in which sequential information is irrelevant. [sent-137, score-0.097]
87 Semantic information is likely to influence only a small subset of the words used in a particular context, with the majority of the words playing functional syntactic roles that are consistet across contexts. [sent-138, score-0.617]
88 Syntax is just as important as semantics for predicting words, and may be an effective means of deciding if a word is context-dependent. [sent-139, score-0.508]
89 In a preliminary exploration of the consequences of combining syntax and semantics in a generative model for language, we applied a simple model combining the syntactic structure of a hidden Markov model (HMM) with the semantic structure of the topic model. [sent-140, score-0.847]
90 We estimated parameters for this model using Gibbs sampling, integrating out the parameters for both the HMM and the topic model and sampling a state and a topic for each of the 11821091 word tokens in the corpus. [sent-142, score-1.148]
91 3 Some of the state and topic distributions from a single sample after 1000 iterations are shown in Table 2. [sent-143, score-0.394]
92 The states of the HMM accurately picked out many of the functional classes of English syntax, while the state corresponding to the topic model was used to capture the context-specific distributions over nouns. [sent-144, score-0.394]
93 3This larger number is a result of including low frequency and stop words. [sent-145, score-0.091]
94 Combining the topic model with the HMM seems to have advantages for both: no function words are absorbed into the topics, and the HMM does not need to deal with the context-specific variation in nouns. [sent-146, score-0.613]
95 The model also seems to do a good job of generating topic-specific text - we can clamp the distribution over topics to pick out those of interest, and then use the model to generate phrases. [sent-147, score-0.359]
96 7 Conclusion Viewing memory and categorization as systems involved in the efficient prediction of an organism's environment can provide insight into these cognitive capacities. [sent-150, score-0.145]
97 Likewise, it is possible to learn about human semantic association by considering the problem of predicting words from their contexts. [sent-151, score-0.868]
98 Latent Semantic Analysis addresses this problem, and provides a good account of human semantic association. [sent-152, score-0.395]
99 Here, we have shown that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language, consistent with the hypothesis that the association between words reflects their probabilistic relationships. [sent-153, score-0.691]
100 The great promise of this approach is the potential to explore how more sophisticated statistical models of language, such as those incorporating both syntax and semantics, might help us understand cognition. [sent-154, score-0.126]
wordName wordTfidf (topN-words)
[('word', 0.385), ('topic', 0.369), ('lsa', 0.302), ('topics', 0.301), ('semantic', 0.264), ('words', 0.244), ('association', 0.206), ('zi', 0.131), ('norms', 0.119), ('cosine', 0.114), ('love', 0.112), ('associates', 0.098), ('document', 0.097), ('frequency', 0.091), ('documents', 0.087), ('tasa', 0.086), ('human', 0.082), ('associate', 0.075), ('syntax', 0.075), ('oot', 0.075), ('latent', 0.073), ('predicting', 0.072), ('ranks', 0.071), ('inner', 0.066), ('fwd', 0.064), ('mersenne', 0.064), ('twister', 0.064), ('wlz', 0.064), ('language', 0.063), ('wi', 0.062), ('hmm', 0.062), ('wl', 0.06), ('conversation', 0.056), ('corpus', 0.053), ('semantics', 0.051), ('generative', 0.05), ('addresses', 0.049), ('rank', 0.047), ('prediction', 0.047), ('multinomial', 0.045), ('pi', 0.044), ('chain', 0.044), ('mcevoy', 0.043), ('organism', 0.043), ('ppt', 0.043), ('wil', 0.043), ('wilzi', 0.043), ('usage', 0.043), ('linguistic', 0.041), ('assignment', 0.04), ('svd', 0.039), ('playing', 0.039), ('memory', 0.039), ('closer', 0.038), ('median', 0.038), ('syntactic', 0.038), ('scientist', 0.037), ('correspondence', 0.037), ('context', 0.036), ('retrieval', 0.036), ('capturing', 0.034), ('exploring', 0.034), ('anyone', 0.034), ('explicitly', 0.034), ('dirichlet', 0.033), ('people', 0.033), ('markov', 0.032), ('phrases', 0.032), ('hw', 0.032), ('cognitive', 0.031), ('similarity', 0.03), ('scientists', 0.03), ('tasks', 0.03), ('column', 0.029), ('distribution', 0.029), ('product', 0.029), ('job', 0.029), ('south', 0.029), ('gg', 0.029), ('emitted', 0.029), ('xx', 0.029), ('cue', 0.029), ('jth', 0.029), ('insight', 0.028), ('first', 0.028), ('matrix', 0.027), ('concepts', 0.027), ('predicted', 0.027), ('across', 0.027), ('promise', 0.026), ('nelson', 0.026), ('assessed', 0.026), ('efficiently', 0.026), ('columns', 0.026), ('carlo', 0.026), ('attempting', 0.025), ('influence', 0.025), ('played', 0.025), ('help', 0.025), ('state', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
2 0.2219605 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
3 0.20169136 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
Author: Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor
Abstract: The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.
4 0.13493915 115 nips-2002-Informed Projections
Author: David Tax
Abstract: Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected. Such techniques are generally “unsupervised,” which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets. 1
5 0.11129015 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
Author: Brochu Eric, Nando de Freitas
Abstract: We present a novel, flexible statistical approach for modelling music and text jointly. The approach is based on multi-modal mixture models and maximum a posteriori estimation using EM. The learned models can be used to browse databases with documents containing music and text, to search for music using queries consisting of music and text (lyrics and other contextual information), to annotate text documents with music, and to automatically recommend or identify similar songs.
6 0.10250415 176 nips-2002-Replay, Repair and Consolidation
7 0.099774688 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
8 0.095261283 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
9 0.079671688 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
10 0.072826125 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
11 0.072548024 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures
12 0.07196857 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
13 0.067793913 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
14 0.067350946 83 nips-2002-Extracting Relevant Structures with Side Information
15 0.067044809 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
16 0.065989912 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
17 0.065037318 96 nips-2002-Generalized² Linear² Models
18 0.063723922 40 nips-2002-Bayesian Models of Inductive Generalization
19 0.055666428 150 nips-2002-Multiple Cause Vector Quantization
20 0.051680207 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory
topicId topicWeight
[(0, -0.177), (1, -0.037), (2, -0.004), (3, -0.012), (4, -0.248), (5, 0.087), (6, -0.07), (7, -0.156), (8, 0.069), (9, -0.203), (10, -0.204), (11, -0.083), (12, 0.068), (13, -0.092), (14, 0.003), (15, 0.049), (16, 0.021), (17, 0.024), (18, 0.014), (19, -0.02), (20, -0.027), (21, 0.055), (22, 0.063), (23, 0.055), (24, -0.054), (25, 0.018), (26, 0.018), (27, 0.018), (28, 0.009), (29, 0.037), (30, 0.026), (31, 0.046), (32, -0.009), (33, -0.023), (34, 0.011), (35, -0.019), (36, 0.096), (37, -0.098), (38, -0.01), (39, 0.001), (40, -0.04), (41, -0.067), (42, -0.027), (43, -0.013), (44, 0.101), (45, -0.01), (46, -0.064), (47, -0.006), (48, 0.045), (49, -0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.97372383 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
2 0.82483315 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
Author: Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor
Abstract: The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.
3 0.65200597 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
Author: Bin Wu, K. Wong, David Bodoff
Abstract: We study an explicit parametric model of documents, queries, and relevancy assessment for Information Retrieval (IR). Mean-field methods are applied to analyze the model and derive efficient practical algorithms to estimate the parameters in the problem. The hyperparameters are estimated by a fast approximate leave-one-out cross-validation procedure based on the cavity method. The algorithm is further evaluated on several benchmark databases by comparing with standard algorithms in IR.
4 0.61703378 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
5 0.60989803 176 nips-2002-Replay, Repair and Consolidation
Author: Szabolcs Káli, Peter Dayan
Abstract: A standard view of memory consolidation is that episodes are stored temporarily in the hippocampus, and are transferred to the neocortex through replay. Various recent experimental challenges to the idea of transfer, particularly for human memory, are forcing its re-evaluation. However, although there is independent neurophysiological evidence for replay, short of transfer, there are few theoretical ideas for what it might be doing. We suggest and demonstrate two important computational roles associated with neocortical indices.
6 0.60025173 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
7 0.56616312 115 nips-2002-Informed Projections
8 0.52582896 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
9 0.51629454 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory
10 0.46421188 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
11 0.45255223 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures
12 0.36327115 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus
13 0.36255988 150 nips-2002-Multiple Cause Vector Quantization
14 0.35690513 107 nips-2002-Identity Uncertainty and Citation Matching
15 0.31523624 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
16 0.30897644 174 nips-2002-Regularized Greedy Importance Sampling
17 0.29518396 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
18 0.29359958 83 nips-2002-Extracting Relevant Structures with Side Information
19 0.28951702 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
20 0.28221053 40 nips-2002-Bayesian Models of Inductive Generalization
topicId topicWeight
[(11, 0.068), (14, 0.013), (23, 0.024), (32, 0.214), (42, 0.082), (54, 0.093), (55, 0.047), (56, 0.011), (57, 0.022), (67, 0.028), (68, 0.014), (74, 0.115), (87, 0.022), (92, 0.041), (98, 0.103)]
simIndex simValue paperId paperTitle
1 0.900729 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
Author: Samy Bengio
Abstract: This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. 1
same-paper 2 0.86598533 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
3 0.67032933 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
Author: Max Welling, Simon Osindero, Geoffrey E. Hinton
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.
4 0.66808081 41 nips-2002-Bayesian Monte Carlo
Author: Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.
5 0.66069472 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
6 0.65970844 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
7 0.659266 3 nips-2002-A Convergent Form of Approximate Policy Iteration
8 0.65910614 2 nips-2002-A Bilinear Model for Sparse Coding
9 0.65807933 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
10 0.65666103 135 nips-2002-Learning with Multiple Labels
11 0.65521771 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
12 0.65414995 169 nips-2002-Real-Time Particle Filters
13 0.65363216 74 nips-2002-Dynamic Structure Super-Resolution
14 0.65258223 10 nips-2002-A Model for Learning Variance Components of Natural Images
15 0.6525557 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
16 0.64983112 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
17 0.64888442 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories
18 0.64854544 124 nips-2002-Learning Graphical Models with Mercer Kernels
19 0.64779919 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
20 0.64687151 53 nips-2002-Clustering with the Fisher Score