nips nips2002 nips2002-115 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Tax
Abstract: Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected. Such techniques are generally “unsupervised,” which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. [sent-3, score-0.113]
2 1 Introduction Many practical problems involve modeling large, high-dimensional data sets to uncover similarities or latent structure. [sent-8, score-0.186]
3 Linear low rank approximation techniques such as PCA [12], LSA [5], PLSA [6] and generative aspect models [1] are powerful tools for approaching these tasks. [sent-9, score-0.113]
4 In doing so, they exploit and expose regularities in the data: the hyperplanes represent a latent space whose dimensions are often observed to correspond to distinct latent categories in the data set. [sent-11, score-0.314]
5 For example, an LSA-derived low-rank approximation to a corpus of news stories may have dimensions corresponding to “politics,” “finance,” “sports,” etc. [sent-12, score-0.09]
6 Documents with the same inferred sources (therefore “about” the same topic) generally lie close to each other in the latent space. [sent-13, score-0.148]
7 [10] studied the problem of learning to classify data into pre-existing categories in the presence of labeled and unlabeled examples. [sent-17, score-0.099]
8 In contrast, this paper considers a method for augmenting a traditional unsupervised learning problem with the addition of equivalence classes. [sent-19, score-0.142]
9 We frequently have some reason for believing that a set of observations are similar in some sense without wanting to or being able to say why they are similar. [sent-21, score-0.09]
10 Note that the sets are not required to be comprehensive — we may only have known associations between a handful of observations. [sent-22, score-0.179]
11 Further, the sets are not required to be disjoint; we may know that members of a set are similar, but there is no implication that members of two different sets are dissimilar. [sent-23, score-0.318]
12 In any case, the hope is that by indicating which observations are similar, we can bias our model focus on relevant features and to ignore differences that, while statistically significant, are not correlated with our idea of similarity in the problem at hand. [sent-24, score-0.127]
13 1 Related work There is too large a literature examining the combination of supervised and unsupervised learning to cover here; below I mention in passing some of the most relevant research. [sent-27, score-0.065]
14 In terms of conceptual similarity, multiple discriminant analysis (MDA) and oriented principal components analysis (OPCA) are techniques that attempt to maximize the fidelity of a linear low rank approximation while minimizing the variance of data belonging to designated equivalence classes [2]. [sent-28, score-0.329]
15 The difference with the approach discussed here is that MDA and OPCA maximize a ratio of variances rather than a mixture; this is equivalent to making the assumption that the covariance matrices for each set are tied. [sent-29, score-0.082]
16 In terms of implementation, the present algorithm owes a great deal to the “shadow targets” algorithm for Neuroscale [8, 15], whose eponymous data points enforce equivalence classes on sets of (otherwise) unsupervised data. [sent-32, score-0.214]
17 That algorithm trades fidelity of representation against fidelity of equivalence classes much in the same way as Equation 4, although it does so in the context of a Kohonen neural network instead of a linear mapping. [sent-33, score-0.114]
18 Another closely-related technique is CI-LSI [7], which uses latent semantic analysis for cross-language retrieval. [sent-34, score-0.228]
19 The technique involves training on text documents from a parallel corpus for two or more languages (e. [sent-35, score-0.32]
20 French and English), such that each document exists as both an English and French version. [sent-37, score-0.139]
21 In CI-LSI, each document is merged with its twin, and the hyperplane is fit to the set of paired documents. [sent-38, score-0.246]
22 The goal of CI-LSI matches the goal of this paper, and the technique can in fact be seen as a special case of the informed projections discussed here. [sent-39, score-0.59]
23 By using the “mean” of a pair of documents as a proxy for the documents themselves, we assert that the two come from a common source; fitting a model to a collection of such means finds a maximum likelihood solution subject to the constraint that both members of a pair comes from a common source. [sent-40, score-0.562]
24 2 Informed and uninformed projections To introduce informed projections, I will first briefly review principal components analysis (PCA) and an algorithm for efficiently computing the principal components of a data set. [sent-41, score-0.713]
25 1 PCA and EMPCA Given a finite data set X ⊂ Rn , where each column corresponds to one observation, PCA ˆ can be used to find a rank m approximation X (where m < n) which minimizes the sum 1 1 0. [sent-43, score-0.113]
26 5 2 Figure 1: PCA maximizes the variance of the observations (on left), while an informed projection minimizes variance of projections from observations belonging to the same set. [sent-67, score-0.969]
27 X can then be projected onto the hyperplane defined by C as ˆ X = C(CT C)−1CT X. [sent-73, score-0.203]
28 (1) Although not strictly a generative model, PCA offers a probabilistic interpretation: C represents a maximum likelihood model of the data under the assumption that X consists of (Gaussian) noise-corrupted observations taken from linear combinations of m sources in an ˆ n-dimensional space. [sent-74, score-0.166]
29 Beginning with an arbitrary guess for C, the latent representation of X is computed Y = (CT C)−1CT X (2) after which C is updated to maximize the estimated likelihoods C = XY T (YY T )−1 . [sent-77, score-0.186]
30 (3) Equations 2 and 3 are iterated until convergence (typically less than 10 iterations), at which ˆ time the sum squared error of X’s approximation to X will have been minimized. [sent-78, score-0.082]
31 2 Informed projections PCA only penalizes according to squared distance of an observation xi from its projection xi . [sent-80, score-0.5]
32 Given a Gaussian noise model, xi is the maximum likelihood estimate of xi ’s “source,” ˆ ˆ which is the only constraint with which PCA is concerned. [sent-81, score-0.124]
33 If we believe that a set of observations Si = {x1 , x2 , . [sent-82, score-0.09]
34 For a hyperplane defined by eigenvectors C, the maximum likelihood source is the mean of Si ’s projections onto C, denoted Si . [sent-86, score-0.437]
35 As such, the likelihood should be penalized not only on the basis of the variance of observations around their projections ∑ j ||x j − x j ||2 , but also the variance of the projections around their set means ˆ ˆ ∑i ∑x j ∈Si ||x j − Si ||2 . [sent-87, score-0.676]
36 5, Equation 4 is equivalent to minimizing ∑i ∑x j ∈Si ||x j − Si ||2 under the assumption that all otherwise unaffiliated xi are members of their own singleton sets. [sent-90, score-0.128]
37 This is just the squared distance from each observation to its projected cluster mean, which appears to be the criterion CI-LSI minimizes by averaging documents. [sent-91, score-0.156]
38 3 Finding an informed projection The error criterion in 4 may be efficiently optimized with an expectation-maximization (EM) procedure based on Roweis’ EMPCA [13], alternately computing estimated sources x and maximizing the likelihoods of the observed data given those sources. [sent-93, score-0.525]
39 ˆ The likelihood of a set is maximized by minimizing the variance of projections from members of a set around their mean. [sent-94, score-0.401]
40 This is at odds with the efforts of PCA to maximize likelihood by maximizing the variance of projections from the data set at large. [sent-95, score-0.399]
41 We can ˜ make these forces work together by adding a “complement set” Si for each set Si such that ˜ the variance of Si ’s projections is minimized by maximizing the variance of Si ’s projections. [sent-96, score-0.374]
42 The complement set may be determined analytically, but can also be computed efficiently as an extra step between the “E” and “M” steps of the EM iteration. [sent-97, score-0.084]
43 Given an observation x j ∈ Si , the complement for x j may be computed in terms of its projection x j onto the ˆ hyperplane and Si , the mean of the set. [sent-98, score-0.396]
44 Figure 2: Location of a point’s complement x j with respect to its mean set projection Si ˜ and the current hyperplane. [sent-99, score-0.198]
45 In order to “pull” the current hyperplane in the direction that will minimize x j ’s distance from the set mean, x j must be positioned at a distance of ||x j − x j || from the hyperplane ˜ ˆ such that its projection lies along line from Si to x j at a distance from Si equal to ||x j − x j ||. [sent-100, score-0.328]
46 ||x j − x|| ˆ ||x j − Si || ˆ For efficiency, it is worth noting that by subtracting each set’s mean from its constituent ˜ observations, all sets may be combined into a single zero-mean “superset” S from which complements are computed. [sent-102, score-0.154]
47 Once the complement set has been computed, it can be appended to the original observa˜ tions a to create a joint data set, denoted X + = [X|S], and the “M” step of the EM procedure 1 is continued as before: Y = (CT C)−1CT X + , C = X +Y T (YY T )−1 . [sent-103, score-0.126]
48 apply a “torque” to the current hyperplane around the origin. [sent-105, score-0.107]
49 By multiplying all coordinates of an observation by the same scalar, we scale the torque applied by the same amount. [sent-106, score-0.102]
50 The first two were text data sets taken from the WebKB project and the “20 newsgroups” data set. [sent-108, score-0.116]
51 The third data set consisted of acoustic features from recorded music. [sent-109, score-0.146]
52 Finally, I examine the effect of adding set information to the joint probabilistic model described by Cohn and Hofmann [3]. [sent-110, score-0.139]
53 The result was 1000 documents with 1000 features, where feature f i, j represented the frequency with which term j occurred in document xi . [sent-114, score-0.372]
54 The experiments varied both the fraction of the training data for which set associations were provided (0-1) and the weight given to preserving those sets (also 0-1). [sent-116, score-0.406]
55 For each combination, I ran 40 trials, each using a randomized split of 200 training documents and 100 test documents. [sent-117, score-0.23]
56 Accuracy was evaluated based on leave-one-out nearest neighbor classification over the test set. [sent-118, score-0.176]
57 9 Figure 3: Nearest neighbor classification of WebKB data, where a 5D PCA of document terms has been informed by web page category-determined sets (40 independent train/test splits). [sent-164, score-0.624]
58 The fraction of observations that have been given set assignments is varied from 0 to 1 (left plot), as is β, the weight attached to preserving set associations (right plot). [sent-165, score-0.482]
59 As expected, the more documents that had set associations, the greater the improvement in classification accuracy, but this 2 Obviously, simple nearest neighbor is far from the most effective classification technique for this domain. [sent-167, score-0.405]
60 But the point of the experiment is to evaluate to what degree informing a projection preserves or improves topic locality, which nearest neighbor classifiers are well-suited to measure. [sent-168, score-0.497]
61 3, the sets were not given enough weight to make a difference, while above 0. [sent-172, score-0.158]
62 Beginning with the documents of the 20 newsgroups data set, I again preprocessed the documents as above with Rainbow, but this time kept the entire vocabulary (27214 unique terms), instead of preselecting maximally informative terms. [sent-175, score-0.526]
63 Thirty independent training and test sets of 100 documents each were run for 0 ≤ β ≤ 1, and as before, accuracy was evaluted in terms of leaveone-out classification error on the test set. [sent-177, score-0.437]
64 9 Figure 4: Five categories from 20 newsgroups data set, where a 5D PCA of document terms has been informed by source category (30 train/test splits, for 0 < β < 1). [sent-195, score-0.681]
65 The characteristic learning curve is very similar to that for the WebKB data — an intermediate set weighting yields significantly better performance than the purely supervised or unsupervised cases. [sent-197, score-0.104]
66 There is, however, one notable distinction: in these experiments, there is much less variation in accuracy for large values of β — it almost appears that there are three stable regions of performance. [sent-198, score-0.097]
67 3 Album recognition from acoustic features The third test used a proprietary data set of acoustic properties of recorded music. [sent-200, score-0.259]
68 The data set contained 11252 recorded music tracks from 939 albums. [sent-201, score-0.233]
69 Each observation consisted of 85 highly-processed acoustic features extracted automatically via digital signal processing. [sent-202, score-0.169]
70 The goal of this experiment was to determine whether informing a projected model could improve the accuracy with which it could identify tracks from the same album. [sent-203, score-0.602]
71 Recalling Platt’s playlist selection problem [11], this can serve as a proxy for estimating how well the model can predict whether two tracks “belong together” by the subjective measure of the artist who created the album. [sent-204, score-0.212]
72 For these experiments, I selected the first 8439 tracks (3/4 of the data) for training, assigning each track to be a member of the set defined by the album it came from. [sent-205, score-0.417]
73 The remaining 2813 tracks were used as test data. [sent-210, score-0.201]
74 The 85 dimensional features were projected down into a 10 dimensional space, informing the projection with sets defined by tracks from the same album. [sent-211, score-0.693]
75 As above, I measured the frequency with which each test track had another track from the same album as its nearest neighbor when projected down into this same space. [sent-213, score-0.535]
76 One reason for the meager improvement may be that the features from which the projections were computed had already been weight accuracy ratio β = 0. [sent-215, score-0.512]
77 3144 Table 1: Album recognition results using 2813 test tracks from 316 albums. [sent-226, score-0.201]
78 For each weighting β, “accuracy” is the fraction of times which the closest track to a test track came from the same album; “ratio” indicates the average ratio of intra-album distances to interalbum distances in the test set. [sent-227, score-0.406]
79 In all cases, informing the projection with a weight of β = 0. [sent-228, score-0.445]
80 5 increases the accuracy and decreases the ratio of the model. [sent-229, score-0.143]
81 Interestingly, OPCA slightly outperforms the informed projection for both criteria on this problem. [sent-231, score-0.455]
82 4 Content, context and connections Prior work [3] discussed building joint probabilistic models of a document base, using both the content of the documents and the connections (citations or hyperlinks) between them. [sent-233, score-0.642]
83 A document base frequently contains context as well, in the form of documents from the same source or by the same author. [sent-234, score-0.412]
84 Informed projection provides a way for us to inject this third form of information and further improve our models. [sent-235, score-0.149]
85 Figure 5 summarizes the results of using set information to “inform” the joint content+link models discussed in the previous paper. [sent-236, score-0.094]
86 Instead, we can make use of the observation of Section 1. [sent-239, score-0.057]
87 1 to approximate the informing process by merging documents from the same set. [sent-240, score-0.437]
88 Figure 5 illustrates that this process complements the earlier content+connections approach, providing a joint model of document content, context and connections. [sent-241, score-0.263]
89 5 set membership Figure 5: (left) Classification accuracy of informed vs. [sent-267, score-0.477]
90 uninformed models of separate and joint models of document content and connections, using the WebKB dataset. [sent-268, score-0.395]
91 (right) Effect of adding more document context in the form of set membership information on the Cora data set. [sent-269, score-0.255]
92 4 Discussion and future work The experiments so far indicate that adding set information to a low rank approximation does improve the quality of a model, but only to the extent that the information is used in conjunction with the unsupervised information already present in the data set. [sent-271, score-0.253]
93 The improvement in performance is evident for content models (such as LSA), connection models, and joint models of content and connections. [sent-272, score-0.346]
94 The first is further exploration on the relationship between informed PCA and and the variants of MDA discussed in Section 1. [sent-275, score-0.376]
95 A second broad area for future work is the application of the techniques described here to richer low rank approximation models. [sent-279, score-0.113]
96 While this paper considered the effect of informing PCA, it would be fruitful to examine both the process and effect of informing multinomialbased models [3, 6], fully-generative models [1] and local linear embeddings [14]. [sent-280, score-0.604]
97 The missing link - a probabilistic model of document content and hypertext connectivity. [sent-301, score-0.273]
98 Using latent semantic analysis to improve access to textual information. [sent-321, score-0.224]
99 Learning to classify text from labeled and unlabeled documents. [sent-357, score-0.093]
100 Shadow targets: A novel algorithm for topographic projections by radial basis functions. [sent-398, score-0.21]
wordName wordTfidf (topN-words)
[('informed', 0.341), ('si', 0.316), ('informing', 0.245), ('projections', 0.21), ('pca', 0.194), ('documents', 0.192), ('tracks', 0.163), ('webkb', 0.16), ('opca', 0.153), ('document', 0.139), ('content', 0.134), ('album', 0.133), ('frac', 0.123), ('latent', 0.114), ('projection', 0.114), ('hyperplane', 0.107), ('associations', 0.107), ('newsgroups', 0.107), ('accuracy', 0.097), ('mda', 0.092), ('lsa', 0.091), ('observations', 0.09), ('members', 0.087), ('weight', 0.086), ('complement', 0.084), ('track', 0.082), ('delity', 0.08), ('uninformed', 0.08), ('equivalence', 0.077), ('semantic', 0.075), ('cohn', 0.075), ('acoustic', 0.075), ('neighbor', 0.072), ('sets', 0.072), ('rank', 0.068), ('nearest', 0.066), ('unsupervised', 0.065), ('nigam', 0.064), ('variance', 0.062), ('projected', 0.062), ('empca', 0.061), ('rainbow', 0.061), ('attached', 0.058), ('platt', 0.058), ('classi', 0.058), ('observation', 0.057), ('effect', 0.057), ('preserving', 0.054), ('ct', 0.054), ('summarizes', 0.052), ('categories', 0.05), ('unlabeled', 0.049), ('proxy', 0.049), ('shadow', 0.049), ('odds', 0.049), ('connections', 0.049), ('em', 0.046), ('ratio', 0.046), ('complements', 0.045), ('torque', 0.045), ('approximation', 0.045), ('corpus', 0.045), ('varied', 0.045), ('text', 0.044), ('roweis', 0.044), ('source', 0.044), ('french', 0.043), ('joint', 0.042), ('fraction', 0.042), ('likelihood', 0.042), ('principal', 0.041), ('xi', 0.041), ('dumais', 0.041), ('mccallum', 0.041), ('adding', 0.04), ('weighting', 0.039), ('membership', 0.039), ('came', 0.039), ('technique', 0.039), ('test', 0.038), ('subtracting', 0.037), ('squared', 0.037), ('context', 0.037), ('cation', 0.037), ('features', 0.037), ('improvement', 0.036), ('music', 0.036), ('yy', 0.036), ('likelihoods', 0.036), ('hyperplanes', 0.036), ('splits', 0.036), ('maximize', 0.036), ('exploration', 0.035), ('english', 0.035), ('vocabulary', 0.035), ('burges', 0.035), ('improve', 0.035), ('recorded', 0.034), ('sources', 0.034), ('onto', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 115 nips-2002-Informed Projections
Author: David Tax
Abstract: Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected. Such techniques are generally “unsupervised,” which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets. 1
2 0.19481613 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
Author: Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor
Abstract: The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.
3 0.15344816 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
Author: Bin Wu, K. Wong, David Bodoff
Abstract: We study an explicit parametric model of documents, queries, and relevancy assessment for Information Retrieval (IR). Mean-field methods are applied to analyze the model and derive efficient practical algorithms to estimate the parameters in the problem. The hyperparameters are estimated by a fast approximate leave-one-out cross-validation procedure based on the cavity method. The algorithm is further evaluated on several benchmark databases by comparing with standard algorithms in IR.
4 0.13598892 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
5 0.13493915 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
6 0.1152181 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
7 0.1066402 83 nips-2002-Extracting Relevant Structures with Side Information
8 0.10330726 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
9 0.10282881 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
10 0.090876177 135 nips-2002-Learning with Multiple Labels
11 0.089802332 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
12 0.080671415 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
13 0.077416591 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
14 0.074733183 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
15 0.074579731 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
16 0.072701924 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
17 0.070848085 110 nips-2002-Incremental Gaussian Processes
18 0.070795067 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
19 0.070670277 90 nips-2002-Feature Selection in Mixture-Based Clustering
20 0.068793617 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
topicId topicWeight
[(0, -0.234), (1, -0.076), (2, 0.012), (3, 0.007), (4, -0.172), (5, 0.079), (6, -0.112), (7, -0.212), (8, 0.061), (9, -0.079), (10, -0.193), (11, -0.116), (12, 0.002), (13, -0.058), (14, 0.128), (15, 0.079), (16, -0.063), (17, 0.052), (18, 0.033), (19, 0.05), (20, 0.029), (21, 0.079), (22, 0.058), (23, -0.029), (24, 0.028), (25, -0.024), (26, -0.091), (27, 0.035), (28, -0.037), (29, -0.062), (30, -0.019), (31, 0.004), (32, -0.027), (33, 0.043), (34, -0.061), (35, 0.01), (36, -0.034), (37, 0.05), (38, 0.051), (39, 0.052), (40, 0.077), (41, 0.052), (42, 0.014), (43, -0.023), (44, 0.035), (45, 0.055), (46, 0.034), (47, -0.018), (48, 0.055), (49, 0.062)]
simIndex simValue paperId paperTitle
same-paper 1 0.94367713 115 nips-2002-Informed Projections
Author: David Tax
Abstract: Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected. Such techniques are generally “unsupervised,” which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets. 1
2 0.79861999 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
Author: Bin Wu, K. Wong, David Bodoff
Abstract: We study an explicit parametric model of documents, queries, and relevancy assessment for Information Retrieval (IR). Mean-field methods are applied to analyze the model and derive efficient practical algorithms to estimate the parameters in the problem. The hyperparameters are estimated by a fast approximate leave-one-out cross-validation procedure based on the cavity method. The algorithm is further evaluated on several benchmark databases by comparing with standard algorithms in IR.
3 0.7778123 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
Author: Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor
Abstract: The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.
4 0.74882054 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
Author: Brochu Eric, Nando de Freitas
Abstract: We present a novel, flexible statistical approach for modelling music and text jointly. The approach is based on multi-modal mixture models and maximum a posteriori estimation using EM. The learned models can be used to browse databases with documents containing music and text, to search for music using queries consisting of music and text (lyrics and other contextual information), to annotate text documents with music, and to automatically recommend or identify similar songs.
5 0.65790355 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
6 0.65222597 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
7 0.63459963 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
8 0.58931267 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
9 0.49502268 150 nips-2002-Multiple Cause Vector Quantization
10 0.47970268 190 nips-2002-Stochastic Neighbor Embedding
11 0.46492228 83 nips-2002-Extracting Relevant Structures with Side Information
12 0.45074219 125 nips-2002-Learning Semantic Similarity
13 0.43779856 138 nips-2002-Manifold Parzen Windows
14 0.42301089 111 nips-2002-Independent Components Analysis through Product Density Estimation
15 0.38491631 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
16 0.37099987 96 nips-2002-Generalized² Linear² Models
17 0.36276513 178 nips-2002-Robust Novelty Detection with Single-Class MPM
18 0.35482091 110 nips-2002-Incremental Gaussian Processes
19 0.34968856 36 nips-2002-Automatic Alignment of Local Representations
20 0.34593502 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
topicId topicWeight
[(11, 0.014), (23, 0.011), (42, 0.577), (54, 0.091), (55, 0.02), (67, 0.011), (68, 0.021), (74, 0.08), (92, 0.03), (98, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.95452476 115 nips-2002-Informed Projections
Author: David Tax
Abstract: Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected. Such techniques are generally “unsupervised,” which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets. 1
2 0.93964976 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
Author: Herbert Jaeger
Abstract: Echo state networks (ESN) are a novel approach to recurrent neural network training. An ESN consists of a large, fixed, recurrent
3 0.92401326 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
4 0.91170841 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
Author: Christopher Williams, John S. Shawe-taylor
Abstract: In this paper we analyze the relationships between the eigenvalues of the m x m Gram matrix K for a kernel k(·, .) corresponding to a sample Xl, ... ,X m drawn from a density p(x) and the eigenvalues of the corresponding continuous eigenproblem. We bound the differences between the two spectra and provide a performance bound on kernel peA. 1
5 0.8077758 138 nips-2002-Manifold Parzen Windows
Author: Pascal Vincent, Yoshua Bengio
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier.
6 0.64505517 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
7 0.62697083 46 nips-2002-Boosting Density Estimation
8 0.62440997 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
9 0.59913033 169 nips-2002-Real-Time Particle Filters
10 0.59576631 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
11 0.58338022 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
12 0.57789463 3 nips-2002-A Convergent Form of Approximate Policy Iteration
13 0.57446611 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
14 0.56500369 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
15 0.56480217 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
16 0.56078809 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
17 0.55370426 96 nips-2002-Generalized² Linear² Models
18 0.55209583 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
19 0.55070382 190 nips-2002-Stochastic Neighbor Embedding
20 0.54904473 69 nips-2002-Discriminative Learning for Label Sequences via Boosting