jmlr jmlr2007 jmlr2007-32 knowledge-graph by maker-knowledge-mining

32 jmlr-2007-Euclidean Embedding of Co-occurrence Data


Source: pdf

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. [sent-11, score-0.666]

2 The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. [sent-12, score-0.724]

3 The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. [sent-13, score-0.519]

4 Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming 1. [sent-15, score-0.579]

5 Most current embedding techniques build low dimensional mappings that preserve certain relationships among objects. [sent-23, score-0.587]

6 The methods differ in the relationships they choose to preserve, which range from pairwise distances in multidimensional scaling (MDS) (Cox and Cox, 1984) to neighborhood structure in locally linear embedding (Roweis and Saul, 2000) and geodesic structure in IsoMap (Tenenbaum et al. [sent-24, score-0.666]

7 However, embedding should not be confined to objects of a single type. [sent-27, score-0.642]

8 A joint embedding of different object types could therefore be useful when instances are mapped based on their semantic similarity. [sent-30, score-0.734]

9 Once a joint embedding is achieved, it also naturally defines a measure of similarity between objects of the same type. [sent-31, score-0.775]

10 For instance, joint embedding of images and words induces a distance measure between images that captures their semantic similarity. [sent-32, score-0.779]

11 A key difficulty in constructing joint embeddings of heterogeneous objects is to obtain a good similarity measure. [sent-40, score-0.487]

12 First, one can simply regard co-occurrence rates as approximating pairwise distances, since rates are non-negative and can be used as input to standard metric-based embedding algorithms. [sent-48, score-0.519]

13 We next show how CODE can be extended to jointly embed more than two objects, as demonstrated by jointly embedding words, documents, and authors into a single map. [sent-62, score-0.588]

14 We also obtain quantitative measures of performance by testing the degree to which the embedding captures ground-truth structures in the data. [sent-63, score-0.552]

15 We use these measures to compare CODE to other embedding algorithms, and find that it consistently and significantly outperforms other methods. [sent-64, score-0.519]

16 In this paper, we consider models of the unknown distribution that rely on a joint embedding of the two variables. [sent-71, score-0.625]

17 Formally, this embedding is specified by two functions φ : X → Rq and ψ : Y → Rq that map both categorical variables into the common low dimensional space Rq , as illustrated in Figure 1. [sent-72, score-0.587]

18 The goal of a joint embedding is to find a geometry that reflects well the statistical relationship between the variables. [sent-73, score-0.598]

19 Thus, our models relate the probability p(x, y) of a pair (x, y) to the embedding locations φ(x) and ψ(y). [sent-75, score-0.546]

20 In this work, we focus on the special case in which the model distribution depends on the squared 2 Euclidean distance dx,y between the embedding points φ(x) and ψ(y): 2 dx,y = φ(x) − ψ(y) q 2 = ∑ (φk (x) − ψk (y))2 . [sent-76, score-0.519]

21 However, a major complication of embedding models is that the embedding locations φ(x) and ψ(y) should be insensitive to the marginals p(x) = ∑y p(x, y) and p(y) = ∑x p(x, y). [sent-79, score-1.108]

22 Such an embedding would reflect the marginal of x rather than its statistical relationship with all the other y values. [sent-92, score-0.547]

23 1 Symmetric Interaction Models The goal of joint embedding is to have the geometry in the embedded space reflect the statistical relationships between variables, rather than just their joint probability. [sent-98, score-0.728]

24 The distribution p MM (x, y) satisfies 2 r pMM (x, y) ∝ e−dx,y , providing a direct relation between statistical dependencies and embedding distances. [sent-116, score-0.519]

25 When the X or Y objects have additional structure, it may be possible to infer embeddings of unobserved values. [sent-119, score-0.333]

26 In this case, we can use the embedding distances 2 dx,y to model the conditional word generation process rather than the joint distribution of authors and words. [sent-127, score-0.925]

27 The effect of Equation 9 can then be described informally as that of choosing φ(x) so that the expected value of ψ under pCM (y|x) is the same as its empirical average, that is, placing the embedding of x closer to the embeddings of those y values that have stronger statistical dependence with x. [sent-201, score-0.729]

28 5 To find the local maximum of the log-likelihood with respect to both φ(x) and ψ(y) for a given embedding dimension q, we use a conjugate gradient ascent algorithm with random restarts. [sent-203, score-0.573]

29 Their method also minimizes an averaged distance measure, but normalizes it by the variance of the embedding to avoid trivial solutions. [sent-235, score-0.519]

30 2 Distance-Based Embeddings Multidimensional scaling (MDS) is a well-known geometric embedding method (Cox and Cox, 1984), whose standard version applies to same-type objects with predefined distances. [sent-237, score-0.642]

31 MDS embedding of heterogeneous entities was studied in the context of modeling ranking data (Cox and 2273 G LOBERSON , C HECHIK , P EREIRA AND T ISHBY Cox, 1984, Section 7. [sent-238, score-0.57]

32 Our approach differs from theirs in that we treat the joint embedding of two different spaces. [sent-244, score-0.598]

33 Their embedding thus illustrates which X values are close to which classes, and how the different classes are inter-related. [sent-250, score-0.519]

34 An interesting extension of locally linear embedding (Roweis and Saul, 2000) to heterogeneous embedding was presented by Ham et al. [sent-252, score-1.089]

35 Their method essentially forces the outputs of two locally linear embeddings to be aligned such that corresponding pairs of objects are mapped to similar points. [sent-254, score-0.368]

36 A Bayesian network approach to joint embedding was recently studied in Mei and Shelton (2006) in the context of collaborative filtering. [sent-255, score-0.598]

37 In fact, the SDR model is invariant to the translation of either of the embedding maps (for instance, φ(x)), while 2 fixing the other map ψ(y). [sent-280, score-0.519]

38 For a sufficiently high embedding dimension this approximation is in fact exact, as we show next. [sent-285, score-0.546]

39 However, the Euclidean distances used in the embedding space also allow us to reformulate the problem as constrained convex optimization over the cone of positive semidefinite (PSD) matrices (Boyd and Vandenberghe, 2004). [sent-289, score-0.67]

40 Conversely, any PSD matrix of rank ≤ q can be factorized as AT A, where A is some embedding matrix of dimension q. [sent-300, score-0.643]

41 Note also that the distance between 2 two columns in A is linearly related to the Gram matrix via dxy = gxx + gyy − 2gxy , and thus the embedding distances are linear functions of the elements of G. [sent-302, score-0.695]

42 When the embedding dimension is q = |X| + |Y | the rank constraint is always satisfied and the problem reduces to minG f (G) (11) s. [sent-309, score-0.581]

43 We conclude that when the embedding dimension is of size q = |X| + |Y | the optimization problem of Equation 11 is convex, and thus has no local minima. [sent-316, score-0.546]

44 2277 G LOBERSON , C HECHIK , P EREIRA AND T ISHBY PSD-CODE Input: A set of regularization parameters {λi }n , an embedding dimension q, and empiri=1 ical distribution p(x, y). [sent-366, score-0.546]

45 Output: A q dimensional embedding of X and Y Algorithm • For each value of λi : – Use the projected gradient algorithm to solve the optimization problem in Equation 12 with regularization parameter λi . [sent-367, score-0.592]

46 Figure 3: The PSD-CODE algorithm for finding a low dimensional embedding using PSD optimization. [sent-373, score-0.587]

47 One approach in this case could be to use joint co-occurrence statistics p(x, y, z) for all three object types, and construct a geometric-probabilistic model for the distribution p(x, y, z) using three embeddings φ(x), ψ(y) and ξ(z) (see Section 9 for further discussion of this approach). [sent-406, score-0.321]

48 , k,11 where each X (i) will have an embedding φ(i) (x(i) ) but all models will share the same ψ(y) embedding. [sent-420, score-0.546]

49 Applications We demonstrate the performance of co-occurrence embedding on two real-world types of data. [sent-435, score-0.545]

50 To provide quantitative assessment of the performance of our method, we apply it to embed the document-word 20 Usenet newsgroups data set, and we use the embedding to predict the class (newsgroup) for each document, which was not available when creating the embedding. [sent-440, score-0.652]

51 A common approach in these applications is to obtain some measure of similarity between objects of the same type such as words, and approximate it with distances in the embedding space. [sent-448, score-0.785]

52 Figure 6 shows the joint embedding of documents and words. [sent-458, score-0.734]

53 It can be seen that words indeed characterize the topics of their neighboring documents, so that the joint embedding reflects the underlying structure of the data. [sent-459, score-0.684]

54 Here we demonstrate that extension in embedding three object types from the NIPS database: words, authors, and documents. [sent-468, score-0.577]

55 However, we may also consider a joint embedding for the objects (author, word, doc), since there is a common semantic space underlying all three. [sent-471, score-0.764]

56 We use the two models pCM (author, word) and pCM (doc, word), that is, two conditional models where the word variable is conditioned on the doc or on the author variables. [sent-474, score-0.611]

57 Recall that the embedding of the words is assumed to be the same in both models. [sent-475, score-0.579]

58 We seek an embedding of all three objects that maximizes the weighted sum of the log-likelihood of these two models. [sent-476, score-0.642]

59 to choosing wi = weighted by For example, in this case the log-likelihood of pCM (author, word) will be Figure 8 shows three insets of an embedding that uses the above weighting scheme. [sent-488, score-0.551]

60 To evaluate this sensitivity, we introduce a quantitative measure of embedding quality: the authorship measure. [sent-493, score-0.61]

61 We evaluate the above authorship measure for different values of w i to study the sensitivity of the embedding quality to changing the weights. [sent-499, score-0.577]

62 The overall document embedding was similar to Figure 5 and is not shown here. [sent-501, score-0.603]

63 The resulting matrix is a joint distribution over the document and word variables, and is denoted by p(doc, word). [sent-517, score-0.367]

64 2 Methods Compared Several methods were compared with respect to both homogeneous and heterogeneous embeddings of words and documents. [sent-519, score-0.321]

65 Then the document embedding was √ taken to be U S. [sent-553, score-0.603]

66 An embedding for words can be obtained in a similar manner, but was not used in the current evaluation. [sent-555, score-0.579]

67 MDS searches for an embedding of objects in a low dimensional space, based on a predefined set of pairwise distances (Cox and Cox, 1984). [sent-557, score-0.799]

68 One heuristic approach that is sometimes used for embedding co-occurrence data using standard MDS is to calculate distances between row vectors of the co-occurrence matrix, which is given by p(doc, word) here. [sent-558, score-0.608]

69 This results in an embedding of the row objects (documents). [sent-559, score-0.642]

70 Column objects (words) can be embedded similarly, but there is no straightforward way of embedding both simultaneously. [sent-560, score-0.693]

71 Isomap first creates a graph by connecting each object to m of its neighbors, and then uses distances of paths in the graph for embedding using MDS. [sent-564, score-0.64]

72 Of the above methods, only CA and CODE were used for joint embedding of words and documents. [sent-567, score-0.658]

73 The other methods are not designed for joint embedding and were only used for embedding documents alone. [sent-568, score-1.253]

74 8 1 Figure 9: Evaluation of the authors-words-documents embedding for different likelihood weights. [sent-579, score-0.542]

75 5 results in equal weighting of the models after normalizing for size, and corresponds to the embedding shown in Figure 8. [sent-582, score-0.546]

76 3 Quality Measures for Homogeneous and Heterogeneous Embeddings Quantitative evaluation of embedding algorithms is not straightforward, since a ground-truth embedding is usually not well defined. [sent-587, score-1.038]

77 For the homogeneous embedding of the document objects, we define a measure denoted by doc-doc, which is designed to measure how well documents with identical labels are mapped together. [sent-589, score-0.834]

78 The measure will have the value one for perfect embeddings where same topic documents are always closer than different topic documents. [sent-592, score-0.376]

79 For the heterogeneous embedding of documents and words into a joint map, we defined a measure denoted by doc-word. [sent-594, score-0.875]

80 4 Results Figure 10 (top) illustrates the joint embedding obtained for the CODE model pCM (doc, word) when embedding documents from three different newsgroups. [sent-605, score-1.253]

81 To obtain a quantitative estimate of homogeneous document embedding, we evaluated the docdoc measure for different embedding methods. [sent-608, score-0.666]

82 Figure 11 shows the dependence of this measure on neighborhood size and embedding dimensionality, for the different methods. [sent-609, score-0.578]

83 2287 G LOBERSON , C HECHIK , P EREIRA AND T ISHBY Figure 10: Visualization of two dimensional embeddings of the 20 newsgroups data under two different models. [sent-652, score-0.321]

84 Top: The embedding of documents and words using the conditional word-given-document model pCM (doc, word). [sent-657, score-0.746]

85 Embedding dimension is q = 2 The performance of the heterogeneous embedding of words and documents was evaluated using the doc-word measure for the CA and CODE algorithms. [sent-682, score-0.823]

86 Figure 10 shows an embedding for the conditional model pCM in Equation 3 and for the symmetric model pMM . [sent-690, score-0.55]

87 It can be seen that both models achieve a good embedding of both the relation between documents (different classes mapped to different regions) and document-word relation (words mapped near documents with relevant subjects). [sent-691, score-0.888]

88 1 0 UU MM CU UC CM 0 MC Model UU MM CU UC CM MC Model Figure 14: Comparison of different embedding models. [sent-754, score-0.519]

89 Discussion We presented a method for embedding objects of different types into the same low dimension Euclidean space. [sent-761, score-0.717]

90 This embedding can be used to reveal low dimensional structures when distance measures between objects are unknown or cannot be defined. [sent-762, score-0.71]

91 Furthermore, once the embedding is performed, it induces a meaningful metric between objects of the same type. [sent-763, score-0.642]

92 Such an approach may be used, for example, for embedding images based on accompanying text, and derive the semantic distance between images. [sent-764, score-0.586]

93 We showed that co-occurrence embedding relates statistical correlation to the local geometric structure of one object type with respect to the other. [sent-765, score-0.576]

94 An important question in embedding objects is whether the embedding is unique, namely, can there be two different optimal configurations of points. [sent-771, score-1.161]

95 Co-occurrence embedding does not have to be restricted to distributions over pairs of variables, but can be extended to multivariate joint distributions. [sent-776, score-0.598]

96 An interesting problem in many embedding algorithms is generalization to new values. [sent-781, score-0.519]

97 Here this would correspond to obtaining embeddings for values of X or Y such that p(x) = 0 or p(y) = 0, for instance because a word did not appear in the sample documents. [sent-782, score-0.383]

98 Such an approach has been previously used to extend embedding methods such as LLE to unseen points (Zhang, 2007). [sent-788, score-0.519]

99 These extensions and the results presented in this paper suggest that probability-based continuous embeddings of categorical objects could be applied efficiently and provide accurate models for complex high dimensional data. [sent-791, score-0.406]

100 A Short Review of Correspondence Analysis Correspondence analysis (CA) is an exploratory data analysis method that embeds two variables X and Y into a low dimensional space such that the embedding reflects their statistical dependence (Greenacre, 1984). [sent-793, score-0.587]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('embedding', 0.519), ('pcm', 0.385), ('doc', 0.306), ('embeddings', 0.21), ('word', 0.173), ('code', 0.163), ('psd', 0.159), ('ereira', 0.141), ('hechik', 0.141), ('ishby', 0.141), ('loberson', 0.141), ('mbedding', 0.141), ('documents', 0.136), ('objects', 0.123), ('uclidean', 0.119), ('distances', 0.089), ('pmm', 0.085), ('puu', 0.085), ('document', 0.084), ('joint', 0.079), ('newsgroup', 0.071), ('mds', 0.071), ('newsgroups', 0.065), ('words', 0.06), ('cox', 0.059), ('isomap', 0.057), ('pcu', 0.056), ('occurrence', 0.054), ('svd', 0.052), ('embedded', 0.051), ('heterogeneous', 0.051), ('rq', 0.05), ('gt', 0.048), ('chechik', 0.048), ('globerson', 0.048), ('cca', 0.048), ('author', 0.047), ('neuroscience', 0.047), ('sdr', 0.047), ('dimensional', 0.046), ('euclidean', 0.046), ('marginals', 0.043), ('semantic', 0.043), ('semide', 0.04), ('convex', 0.037), ('mapped', 0.035), ('embed', 0.035), ('rank', 0.035), ('roweis', 0.034), ('authors', 0.034), ('quantitative', 0.033), ('correspondence', 0.033), ('mm', 0.032), ('insets', 0.032), ('object', 0.032), ('conditional', 0.031), ('matrix', 0.031), ('log', 0.031), ('walk', 0.03), ('aa', 0.03), ('ns', 0.03), ('measure', 0.03), ('neighborhood', 0.029), ('multidimensional', 0.029), ('database', 0.029), ('gal', 0.028), ('tishby', 0.028), ('authorship', 0.028), ('gxx', 0.028), ('gyy', 0.028), ('iwata', 0.028), ('ca', 0.028), ('marginal', 0.028), ('models', 0.027), ('dimension', 0.027), ('eigenvalues', 0.027), ('lt', 0.027), ('gradient', 0.027), ('types', 0.026), ('ects', 0.026), ('topics', 0.026), ('vb', 0.026), ('obermayer', 0.026), ('architectures', 0.026), ('correlation', 0.025), ('factorization', 0.025), ('vandenberghe', 0.025), ('matrices', 0.025), ('images', 0.024), ('saul', 0.024), ('border', 0.024), ('boyd', 0.024), ('weinberger', 0.024), ('grad', 0.024), ('greenacre', 0.024), ('shawe', 0.024), ('similarity', 0.024), ('likelihood', 0.023), ('becker', 0.023), ('low', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

2 0.085616998 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Author: Evgeniy Gabrilovich, Shaul Markovitch

Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge

3 0.074310362 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

4 0.064882211 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Author: Masashi Sugiyama

Abstract: Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick. Keywords: dimensionality reduction, supervised learning, Fisher discriminant analysis, locality preserving projection, affinity matrix

5 0.063559793 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie

Abstract: Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vsthe-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weighting of binary classifiers that improves multi-class prediction with respect to a fixed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we define codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and find that our method strongly outperforms it on every structure clas∗. The first two authors contributed equally to this work. c 2007 Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble and Christina Leslie. M ELVIN , I E , W ESTON , N OBLE AND L ESLIE sification problem that we consider. Supplementary data and source code are available at http: //www.cs

6 0.057288095 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

7 0.055433065 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

8 0.039629292 82 jmlr-2007-The Need for Open Source Software in Machine Learning

9 0.035773043 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics

10 0.035361312 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors

11 0.034879308 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

12 0.034157425 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs     (Special Topic on the Conference on Learning Theory 2005)

13 0.03394457 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

14 0.033032537 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

15 0.032487333 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

16 0.032354467 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

17 0.031708296 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

18 0.031333286 72 jmlr-2007-Relational Dependency Networks

19 0.031283621 23 jmlr-2007-Concave Learners for Rankboost

20 0.029401915 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, 0.085), (2, 0.029), (3, 0.09), (4, -0.167), (5, 0.017), (6, -0.068), (7, -0.05), (8, 0.092), (9, -0.053), (10, -0.057), (11, 0.028), (12, -0.269), (13, 0.228), (14, 0.038), (15, -0.114), (16, 0.134), (17, 0.073), (18, 0.087), (19, -0.093), (20, 0.064), (21, -0.007), (22, -0.007), (23, -0.071), (24, -0.064), (25, -0.149), (26, 0.013), (27, -0.08), (28, -0.132), (29, 0.192), (30, -0.057), (31, 0.008), (32, -0.069), (33, -0.129), (34, 0.047), (35, 0.155), (36, -0.051), (37, 0.069), (38, -0.008), (39, 0.053), (40, 0.079), (41, -0.181), (42, -0.091), (43, -0.016), (44, -0.053), (45, -0.038), (46, -0.026), (47, 0.191), (48, 0.163), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96326315 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

2 0.50918102 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Author: Masashi Sugiyama

Abstract: Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick. Keywords: dimensionality reduction, supervised learning, Fisher discriminant analysis, locality preserving projection, affinity matrix

3 0.43558535 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

4 0.39079589 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Author: Evgeniy Gabrilovich, Shaul Markovitch

Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge

5 0.37090364 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie

Abstract: Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vsthe-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weighting of binary classifiers that improves multi-class prediction with respect to a fixed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we define codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and find that our method strongly outperforms it on every structure clas∗. The first two authors contributed equally to this work. c 2007 Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble and Christina Leslie. M ELVIN , I E , W ESTON , N OBLE AND L ESLIE sification problem that we consider. Supplementary data and source code are available at http: //www.cs

6 0.29282448 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

7 0.24355207 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

8 0.23357178 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

9 0.2259526 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

10 0.20556134 15 jmlr-2007-Bilinear Discriminant Component Analysis

11 0.19440852 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

12 0.18838108 82 jmlr-2007-The Need for Open Source Software in Machine Learning

13 0.18122385 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

14 0.17847654 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

15 0.17438066 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

16 0.16526508 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics

17 0.16162297 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

18 0.16096835 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers     (Special Topic on Model Selection)

19 0.15962712 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors

20 0.15655141 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions     (Special Topic on the Conference on Learning Theory 2005)


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.342), (1, 0.011), (4, 0.024), (8, 0.032), (10, 0.023), (12, 0.029), (15, 0.055), (22, 0.014), (28, 0.05), (40, 0.046), (45, 0.024), (48, 0.061), (60, 0.029), (80, 0.025), (85, 0.066), (98, 0.086)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7041766 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

2 0.38977304 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

3 0.38495678 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

4 0.38209832 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Author: Masashi Sugiyama

Abstract: Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick. Keywords: dimensionality reduction, supervised learning, Fisher discriminant analysis, locality preserving projection, affinity matrix

5 0.38201356 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

6 0.37927359 39 jmlr-2007-Handling Missing Values when Applying Classification Models

7 0.37783799 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

8 0.37516344 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

9 0.37207752 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

10 0.36939758 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

11 0.3691681 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

12 0.36711368 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

13 0.36660028 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

14 0.36625677 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

15 0.36543861 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

16 0.36352861 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

17 0.36343163 72 jmlr-2007-Relational Dependency Networks

18 0.36287165 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

19 0.36260444 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

20 0.36248276 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression