nips nips2008 nips2008-194 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. [sent-10, score-0.27]
2 The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. [sent-12, score-0.705]
3 During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. [sent-13, score-0.236]
4 For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. [sent-14, score-0.486]
5 For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. [sent-15, score-0.624]
6 1 Introduction For many important problems in machine learning, we have a limited amount of labeled training data and a very high-dimensional feature space. [sent-16, score-0.282]
7 Since our focus is on classification, the parameters we consider are feature weights in a linear classifier. [sent-26, score-0.345]
8 The prior knowledge is encoded as a network or graph whose nodes represent features and whose edges represent similarities between the features in terms of how likely they are to have similar weights. [sent-27, score-0.842]
9 During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. [sent-28, score-0.236]
10 This regularization objective is closely connected to the unsupervised dimensionality reduction method, locally linear embedding (LLE), proposed by Roweis and Saul [7]. [sent-29, score-0.285]
11 The network is typically sparse in that each feature has only a small number of neighbors. [sent-34, score-0.363]
12 Consequently, we can implicitly construct rich and dense covariance matrices over large feature spaces without incurring the space and computational blow-ups that would be incurred if we attempted to construct these matrices explicitly. [sent-36, score-0.454]
13 We show that regularization with feature-networks derived from word co-occurrence statistics outperforms manifold regularization and another, more recent, semisupervised learning approach [5] on the task of text classification. [sent-38, score-0.669]
14 Feature network based regularization also supports extensions which provide flexibility in modeling parameter dependencies, allowing for feature dissimilarities and the introduction of feature classes whose weights have common but unknown means. [sent-39, score-1.245]
15 2 Regularized Learning with Networks of Features We assume a standard supervised learning framework in which we are given a training set of instances T = {(xi , yi )}n with xi ∈ Rd and associated labels yi ∈ Y. [sent-42, score-0.305]
16 , d}, correspond to the features of our model and whose edges link features whose weights are believed to be similar. [sent-49, score-0.649]
17 Conversely, a weight of zero means that two features are not believed a priori to be similar. [sent-51, score-0.339]
18 The weights of G are encoded by a matrix, P , where Pij ≥ 0 gives the weight of the directed edge from vertex i to vertex j. [sent-54, score-0.447]
19 We constrain the out-degree of each vertex to sum to one, j Pij = 1, so that no feature “dominates” the graph. [sent-55, score-0.297]
20 Because the semantics of the graph are that linked features should have similar weights, we penalize each feature’s weight by the squared amount it differs from the weighted average of its neighbors. [sent-56, score-0.492]
21 This gives us the following criterion to optimize in learning: n loss(w) = d l(xi , yi ; w) + α i=1 j=1 wj − Pjk wk 2 + β w 2, 2 (1) k where we have added a ridge term to make the loss strictly convex. [sent-57, score-0.423]
22 The hyperparameters α and β specify the amount of network and ridge regularization respectively. [sent-58, score-0.655]
23 The regularization penalty can be rewritten as w⊤ M w where M = α (I − P )⊤ (I − P ) + β I. [sent-59, score-0.383]
24 The matrix M is symmetric positive definite, and therefore our criterion possesses a Bayesian interpretation in which the weight vector, w, is a priori normally distributed with mean zero and covariance matrix 2M −1 . [sent-60, score-0.366]
25 Additionally, the induced covariance matrix M −1 will typically be dense even though P is sparse, showing that we can construct dense covariance structures over w without incurring storage and computation costs. [sent-67, score-0.374]
26 The solution to equation (2) is found by performing eigen-decomposition on the matrix (I − P )⊤ (I − P ) = U ΛU ⊤ where U is the matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues. [sent-75, score-0.3]
27 Looking at equation (1) and ignoring the ridge term, it is clear that our feature network regularization penalty is identical to LLE except that the embedding is found for the feature weights rather than data instances. [sent-86, score-1.425]
28 If we let L(Y, Xw) denote the unregularized loss over the training set where X is the n × d matrix of instances and Y is the n-vector of class labels, we can express equation (1) in matrix form as w∗ = argmin L(Y, Xw) + w⊤ α (I − P )⊤ (I − P ) + β I w. [sent-88, score-0.333]
29 (3) w ˜ Defining X to be XU (αΛ + β I)−1/2 where U and Λ are from the eigen-decomposition above, it is not hard to show that equation (3) is equivalent to the alternative ridge regularized learning problem ˜˜ ˜ ˜ ˜ w∗ = argmin L(Y, X w) + w⊤ w. [sent-89, score-0.33]
30 In using all of the eigenvectors, the dimensionality of the feature embedding is not reduced. [sent-92, score-0.296]
31 In the high dimensional problems of section 4, we find that regularization with feature networks outperforms LLE-based regression. [sent-96, score-0.558]
32 1 Regularizing with Classes of Features In machine learning, features can often be grouped into classes, such that all the weights of the features in a given class are drawn from the same underlying distribution. [sent-103, score-0.423]
33 Using an appropriately constructed feature graph, we can model the case in which the underlying distributions are believed to be Gaussians with known, identical variances but with unknown means. [sent-105, score-0.352]
34 That is, the case in which there are k disjoint classes of features {Ci }k whose weights are drawn i. [sent-106, score-0.378]
35 Additionally, the number of edges in this construction scales quadratically in the clique sizes, resulting in feature graphs that are not sparse. [sent-111, score-0.296]
36 , fk , that do not appear in any of the data instances but whose weights µ1 , . [sent-115, score-0.309]
37 In creating the feature graph, we link each feature to the virtual feature for its class with an edge of weight one. [sent-122, score-1.006]
38 Denoting the class of feature i as c(i), and setting the hyperparameters α and β in equation (1) to d 1 1/(2σ 2 ) and 0, respectively, yields a network regularization cost of 2 σ −2 i=1 (wi − µc(i) )2 . [sent-124, score-0.655]
39 Since ˆ the virtual features do not appear in any instances, i. [sent-125, score-0.283]
40 their values are zero in every data instance, their weights are free to take on whatever values minimize the network regularization cost in (1), in particular the estimates of the class means, µ1 , . [sent-127, score-0.507]
41 Consequently, minimizing the network regularization penalty maximizes the log-likelihood for the intended scenario. [sent-131, score-0.518]
42 We can extend this construction to model the case in which the feature weights are drawn from a mixture of Gaussians by connecting each feature to a number of virtual features with edge weights that sum to one. [sent-132, score-1.047]
43 2 Incorporating Feature Dissimilarities Feature network regularization can also be extended to induce features to have opposing weights. [sent-134, score-0.505]
44 Such feature “dissimilarities” can be useful in tasks such as sentiment prediction where we would like weights for words such as “great” or “fantastic” to have opposite signs from their negated bigram counterparts “not great” and “not fantastic,” and from their antonyms. [sent-135, score-0.776]
45 To model dissimilarities, we construct a separate graph whose edges represent anti-correlations between features. [sent-136, score-0.306]
46 Regularizing over this graph enforces each feature’s weight to be equal to the negative of the average of the neighboring weights. [sent-137, score-0.322]
47 To do this, we encode the dissimilarity graph using a matrix Q, defined analogously 2 to the matrix P , and add the term i wi + j Qij wj to the network regularization criterion, which can be written as w⊤ (I +Q)⊤ (I +Q)w. [sent-138, score-0.824]
48 [12] use a similar construction with the graph Laplacian in order to incorporate dissimilarities between instances in manifold learning. [sent-141, score-0.581]
49 3 Regularizing Features with the Graph Laplacian A natural alternative to the network regularization criterion given in section (2) is to regularize the feature weights using a penalty derived from the graph Laplacian [13]. [sent-143, score-1.108]
50 Here, the feature graph’s edge weights are given by a symmetric matrix, W , whose entries, Wij ≥ 0, give the weight of the edge 1 between features i and j. [sent-144, score-0.809]
51 The Laplacian penalty is 2 i,j Wij (wi − wj )2 which can be written as ⊤ w (D−W ) w, where D = diag(W 1) is the vertex degree matrix. [sent-145, score-0.305]
52 The main difference between the Laplacian penalty and the network penalty in equation (1) is that the Laplacian penalizes each edge equally (modulo the edge weights) whereas the network penalty penalizes each feature equally. [sent-146, score-1.144]
53 In graphs where there are large differences in vertex degree, the Laplacian penalty will therefore focus most of the regularization cost on features with many neighbors. [sent-147, score-0.643]
54 Experiments in section 4 show that the criterion in (1) outperforms the Laplacian√ penalty as well as a related penalty derived from the normalized graph Laplacian, 1 i,j Wij (wi / Dii − wj / Djj )2 . [sent-148, score-0.632]
55 The normalized Laplacian 2 √ penalty assumes that Djj wi ≈ Dii wj , which is different from assuming that linked features should have similar weights. [sent-149, score-0.5]
56 Laplacian Laplacian Ridge Penalty 60 100 200 500 1000 Number of Training Instances 2000 100 200 500 1000 Number of Training Instances Figure 1: Left: Accuracy of feature network regularization (FNR) and five baselines on “20 newsgroups” data. [sent-151, score-0.62]
57 4 Experiments We evaluated logistic regression augmented with feature network regularization on two natural language processing tasks. [sent-153, score-0.618]
58 The second was sentiment classification of product reviews, the task of classifying user-written reviews according to whether they are favorable or unfavorable to the product under review based on the review text [11]. [sent-155, score-0.483]
59 For document classification, the feature graph was constructed using feature co-occurrence statistics gleaned from unlabeled data. [sent-157, score-0.807]
60 In sentiment prediction, both co-occurrence statistics and prior domain knowledge were used. [sent-158, score-0.306]
61 1 Experiments on 20 Newsgroups We evaluated feature network based regularization on the 20 newsgroups classification task using all twenty classes. [sent-160, score-0.682]
62 The feature set was restricted to the 11,376 words which occurred in at least 20 documents, not counting stop-words. [sent-161, score-0.348]
63 To construct the feature graph, each feature (word) was represented by a binary vector denoting its presence/absence in each of the 20,000 documents of the dataset. [sent-163, score-0.54]
64 Each feature was linked to the 25 other features with highest cosine scores, provided that the scores were above a minimum threshold of 0. [sent-165, score-0.49]
65 The edge weights of the graph were set to these cosine scores and the matrix P was constructed by normalizing each vertex’s out-degree to sum to one. [sent-167, score-0.515]
66 For ridge, the amount of L2 regularization was chosen using cross validation on the training set. [sent-171, score-0.271]
67 Similarly, for feature network regularization and the Laplacian regularizers, the hyperparameters α and β were chosen through cross validation on the training set using a simple grid search. [sent-172, score-0.671]
68 The results in figure 1 show that feature network regularization with a graph constructed from unlabeled data outperforms all baselines and increases accuracy by 4%-17% over the plain ridge penalty, an error reduction of 17%-30%. [sent-178, score-1.25]
69 Additionally, by scaling the eigenvectors by their eigenvalues, feature network regularization keeps more information about the directions of least cost in weight space than does LLE regression, which does not rescale the eigenvectors but simply keeps or discards them (i. [sent-181, score-0.94]
70 However, the graph was constructed so that each document had only K = 10 neighbors (the authors in [10] use a fully connected graph which does not fit in memory for the entire 20 newsgroups dataset). [sent-188, score-0.586]
71 Here we see that feature network regularization is competitive with the other semi-supervised methods and performs best at all but the smallest training set size. [sent-190, score-0.634]
72 2 Sentiment Classification For sentiment prediction, we obtained the product review datasets used in [11]. [sent-192, score-0.306]
73 In both, we used a list of sentimentally-charged words obtained from the SentiWordNet database [14], a database which associates positive and negative sentiment scores to each word in WordNet. [sent-198, score-0.531]
74 In the first experiment, we constructed a set of feature classes in the manner described in section 3. [sent-199, score-0.347]
75 From SentiWordNet we extracted a list of roughly 200 words with high positive and negative sentiment scores that also occurred in the product reviews at least 100 times. [sent-202, score-0.612]
76 1, all words in the positive cluster were attached to a virtual feature representing the mean feature weight of the positive cluster words, and all words in the negative cluster were attached to a virtual weight representing the mean weight of the negative cluster words. [sent-205, score-1.474]
77 As shown in figure 2, imposing feature clusters on the two classes of words improves performance noticeably while the addition of the feature dissimilarity edge does not yield much benefit. [sent-209, score-0.742]
78 This simple set of experiments demonstrated the applicability of feature classes for inducing groups of features to have similar means, and that the words extracted from SentiWordNet were relatively helpful in determining the sentiment of a review. [sent-211, score-0.833]
79 Thus we extended the feature sets to include all unigram and bigram word-features which occurred in ten or more reviews. [sent-213, score-0.307]
80 The total number of reviews and size of the feature sets is given in table 1. [sent-214, score-0.322]
81 Thus, to align features along dimensions of ‘sentiment,’ we computed the correlations of all features with the SentiWordNet features so that each word was represented as a 200 dimensional vector of correlations with these highly charged sentiment words. [sent-219, score-0.904]
82 Two features were linked if both were in the other’s set of nearest 100 neighbors. [sent-222, score-0.262]
83 For simplicity, the edge weights were set to one and the graph weight matrix was then row-normalized in order to construct the matrix P . [sent-223, score-0.634]
84 The number of edges in each feature graph is given in table 1. [sent-224, score-0.453]
85 The ‘kitchen’ dataset was used as a development dataset in order to arrive at the method for constructing the feature graph and for choosing the hyperparameter values: α = 9. [sent-225, score-0.423]
86 Figure 3 gives accuracy results for all four sentiment datasets at training sets of 50 to 1000 instances. [sent-228, score-0.396]
87 The results show that linking features which are similarly correlated with sentiment-loaded words yields improvements on every dataset and at every training set size. [sent-229, score-0.29]
88 [15]) which can be interpreted as using the graph Laplacian regularizer but with an L1 norm instead of L2 on the residuals of weight differences: i j∼i |wi − wj | and all edge weights set to one. [sent-231, score-0.536]
89 As the authors discuss, an L1 penalty prefers that weights of linked features be exactly equal so that the residual vector of weight differences is sparse. [sent-232, score-0.618]
90 L1 is appropriate if the true weights are believed to be exactly equal, but in many settings, features are near copies of one another whose weights should be similar rather than identical. [sent-233, score-0.5]
91 Optimizing L1 feature weight differences also leads to a much harder optimization problem, making it less applicable in large scale learning. [sent-235, score-0.346]
92 Li and Li [13] regularize feature weights using the normalized graph Laplacian in their work on biomedical prediction tasks. [sent-236, score-0.555]
93 The authors represent each feature i as a vector of meta-features, ui , and compute the entries of the feature weight covariance 1 matrix, Cij = exp(− 2σ2 ui − uj 2 ). [sent-240, score-0.655]
94 Obviously, the choice of which is more appropriate, a feature graph or metric space, is application dependent. [sent-241, score-0.385]
95 However, it is less obvious how to incorporate feature dissimilarities in a metric space. [sent-242, score-0.401]
96 Interestingly, the entries of this covariance matrix are learned jointly with a regression model for predicting feature weight covariances as a function of meta-features. [sent-248, score-0.531]
97 6 Conclusion We have presented regularized learning with networks of features, a simple and flexible framework for incorporating expectations about feature weight similarities in learning. [sent-250, score-0.572]
98 Feature similarities are modeled using a feature graph and the weight of each feature is preferred to be close to the average of its neighbors. [sent-251, score-0.817]
99 On the task of document classification, feature network regularization is superior to several related criteria, as well as to a manifold learning approach where the graph models similarities between instances rather than between features. [sent-252, score-1.135]
100 Extensions for modeling feature classes, as well as feature dissimilarities, yielded benefits on the problem of sentiment prediction. [sent-253, score-0.762]
wordName wordTfidf (topN-words)
[('sentiment', 0.306), ('ridge', 0.266), ('feature', 0.228), ('regularization', 0.217), ('fnr', 0.192), ('dissimilarities', 0.173), ('sentiwordnet', 0.168), ('penalty', 0.166), ('lle', 0.162), ('laplacian', 0.162), ('graph', 0.157), ('features', 0.153), ('instances', 0.147), ('sim', 0.144), ('network', 0.135), ('virtual', 0.13), ('aso', 0.12), ('weight', 0.118), ('weights', 0.117), ('manifold', 0.104), ('newsgroups', 0.102), ('eigenvectors', 0.102), ('dissim', 0.096), ('dvds', 0.096), ('reviews', 0.094), ('similarities', 0.086), ('words', 0.083), ('kitchen', 0.081), ('covariance', 0.081), ('unlabeled', 0.077), ('pij', 0.077), ('edge', 0.074), ('pcr', 0.072), ('books', 0.072), ('wj', 0.07), ('vertex', 0.069), ('edges', 0.068), ('believed', 0.068), ('embedding', 0.068), ('matrix', 0.066), ('dissimilarity', 0.066), ('linked', 0.064), ('regularized', 0.064), ('classes', 0.063), ('electronics', 0.062), ('classi', 0.061), ('document', 0.061), ('constructed', 0.056), ('training', 0.054), ('blitzer', 0.054), ('neighbors', 0.053), ('regularize', 0.053), ('yi', 0.052), ('wordnet', 0.051), ('xw', 0.051), ('word', 0.05), ('appliances', 0.048), ('charged', 0.048), ('dip', 0.048), ('djj', 0.048), ('fantastic', 0.048), ('reimplementation', 0.048), ('documents', 0.048), ('wi', 0.047), ('negative', 0.047), ('scores', 0.045), ('whose', 0.045), ('tibshirani', 0.045), ('nearest', 0.045), ('eigenvalues', 0.044), ('regularizing', 0.043), ('text', 0.043), ('bigram', 0.042), ('krupka', 0.042), ('kegg', 0.042), ('synonyms', 0.042), ('incorporating', 0.042), ('dimensional', 0.041), ('favorable', 0.04), ('baselines', 0.04), ('extensions', 0.039), ('fused', 0.038), ('constructing', 0.038), ('cost', 0.038), ('regression', 0.038), ('outperforms', 0.038), ('dense', 0.037), ('hyperparameters', 0.037), ('occurred', 0.037), ('cluster', 0.036), ('construct', 0.036), ('goldberg', 0.036), ('incurring', 0.036), ('stars', 0.036), ('accuracy', 0.036), ('criterion', 0.035), ('wij', 0.034), ('networks', 0.034), ('nucleic', 0.034), ('additionally', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
2 0.16681021 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
3 0.16532634 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
Author: Benjamin Yackley, Eduardo Corona, Terran Lane
Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1
4 0.16005722 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Author: Yi Zhang, Artur Dubrawski, Jeff G. Schneider
Abstract: In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used.
5 0.14399518 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
Author: Tong Zhang
Abstract: We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-L1 regularization. Our performance bound shows that the procedure is superior to the standard L1 convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data. 1
6 0.14274596 193 nips-2008-Regularized Co-Clustering with Dual Supervision
8 0.12666923 84 nips-2008-Fast Prediction on a Tree
9 0.12224044 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
10 0.1179284 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
11 0.11137439 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
12 0.10955317 214 nips-2008-Sparse Online Learning via Truncated Gradient
13 0.10528078 4 nips-2008-A Scalable Hierarchical Distributed Language Model
14 0.10463566 171 nips-2008-Online Prediction on Large Diameter Graphs
15 0.09604536 78 nips-2008-Exact Convex Confidence-Weighted Learning
16 0.09486571 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
17 0.093482576 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
18 0.092314124 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference
19 0.092308208 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
20 0.085407197 107 nips-2008-Influence of graph construction on graph-based clustering measures
topicId topicWeight
[(0, -0.289), (1, -0.146), (2, -0.057), (3, 0.02), (4, 0.047), (5, -0.058), (6, 0.06), (7, -0.009), (8, -0.101), (9, -0.107), (10, 0.002), (11, 0.007), (12, -0.286), (13, -0.11), (14, -0.019), (15, -0.092), (16, -0.092), (17, 0.068), (18, -0.021), (19, -0.017), (20, -0.07), (21, -0.072), (22, -0.019), (23, -0.06), (24, -0.012), (25, 0.068), (26, 0.003), (27, -0.051), (28, -0.099), (29, -0.025), (30, 0.1), (31, 0.028), (32, 0.011), (33, 0.116), (34, 0.02), (35, -0.041), (36, 0.007), (37, -0.073), (38, -0.088), (39, -0.1), (40, 0.002), (41, 0.067), (42, -0.004), (43, -0.11), (44, -0.051), (45, 0.023), (46, -0.027), (47, 0.102), (48, -0.031), (49, -0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.97237968 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
2 0.68925655 193 nips-2008-Regularized Co-Clustering with Dual Supervision
Author: Vikas Sindhwani, Jianying Hu, Aleksandra Mojsilovic
Abstract: By attempting to simultaneously partition both the rows (examples) and columns (features) of a data matrix, Co-clustering algorithms often demonstrate surprisingly impressive performance improvements over traditional one-sided row clustering techniques. A good clustering of features may be seen as a combinatorial transformation of the data matrix, effectively enforcing a form of regularization that may lead to a better clustering of examples (and vice-versa). In many applications, partial supervision in the form of a few row labels as well as column labels may be available to potentially assist co-clustering. In this paper, we develop two novel semi-supervised multi-class classification algorithms motivated respectively by spectral bipartite graph partitioning and matrix approximation formulations for co-clustering. These algorithms (i) support dual supervision in the form of labels for both examples and/or features, (ii) provide principled predictive capability on out-of-sample test data, and (iii) arise naturally from the classical Representer theorem applied to regularization problems posed on a collection of Reproducing Kernel Hilbert Spaces. Empirical results demonstrate the effectiveness and utility of our algorithms. 1
3 0.6713295 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
Author: Benjamin Yackley, Eduardo Corona, Terran Lane
Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1
4 0.63792747 225 nips-2008-Supervised Bipartite Graph Inference
Author: Yoshihiro Yamanishi
Abstract: We formulate the problem of bipartite graph inference as a supervised learning problem, and propose a new method to solve it from the viewpoint of distance metric learning. The method involves the learning of two mappings of the heterogeneous objects to a unified Euclidean space representing the network topology of the bipartite graph, where the graph is easy to infer. The algorithm can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of compound-protein interaction network reconstruction from chemical structure data and genomic sequence data. 1
Author: Liu Yang, Rong Jin, Rahul Sukthankar
Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.
6 0.59336555 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
7 0.58524972 84 nips-2008-Fast Prediction on a Tree
8 0.55403209 62 nips-2008-Differentiable Sparse Coding
9 0.53257626 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
10 0.51217991 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
11 0.50557113 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
12 0.49902043 107 nips-2008-Influence of graph construction on graph-based clustering measures
13 0.49501395 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
14 0.49372265 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
15 0.49048442 171 nips-2008-Online Prediction on Large Diameter Graphs
16 0.48927048 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
17 0.48883143 126 nips-2008-Localized Sliced Inverse Regression
18 0.47840646 69 nips-2008-Efficient Exact Inference in Planar Ising Models
19 0.46693757 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
20 0.46349871 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
topicId topicWeight
[(3, 0.15), (6, 0.092), (7, 0.111), (12, 0.048), (15, 0.015), (28, 0.154), (57, 0.072), (59, 0.021), (63, 0.039), (71, 0.019), (77, 0.052), (78, 0.012), (80, 0.048), (83, 0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.89481795 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
2 0.88827503 228 nips-2008-Support Vector Machines with a Reject Option
Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu
Abstract: We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1
3 0.82424217 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
Author: Benjamin Yackley, Eduardo Corona, Terran Lane
Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1
4 0.81713831 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
5 0.81539822 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
7 0.8072108 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
8 0.80682385 75 nips-2008-Estimating vector fields using sparse basis field expansions
9 0.80408508 202 nips-2008-Robust Regression and Lasso
10 0.80398738 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
11 0.80384278 143 nips-2008-Multi-label Multiple Kernel Learning
12 0.80135679 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
13 0.79769224 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
14 0.79695749 99 nips-2008-High-dimensional support union recovery in multivariate regression
15 0.7968325 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
16 0.79430586 226 nips-2008-Supervised Dictionary Learning
17 0.79319775 196 nips-2008-Relative Margin Machines
18 0.793042 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
19 0.79274511 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
20 0.79173976 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data