nips nips2005 nips2005-195 knowledge-graph by maker-knowledge-mining

195 nips-2005-Transfer learning for text classification

Source: pdf

Author: Chuong B. Do, Andrew Y. Ng

Abstract: Linear text classiﬁcation algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classiﬁcation over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classiﬁcation problems. The parameter function found by our algorithm then deﬁnes a new learning algorithm for text classiﬁcation, which we can apply to novel classiﬁcation tasks. We ﬁnd that our learned classiﬁer outperforms existing methods on a variety of multiclass text classiﬁcation tasks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Transfer learning for text classiﬁcation Chuong B. [sent-1, score-0.204]

2 Ng Computer Science Department Stanford University Stanford, CA 94305 Abstract Linear text classiﬁcation algorithms work by computing an inner product between a test document vector and a parameter vector. [sent-3, score-0.431]

3 In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. [sent-4, score-0.438]

4 Much research in text classiﬁcation over the last few decades has consisted of manual efforts to identify better parameter functions. [sent-5, score-0.231]

5 The parameter function found by our algorithm then deﬁnes a new learning algorithm for text classiﬁcation, which we can apply to novel classiﬁcation tasks. [sent-7, score-0.262]

6 We ﬁnd that our learned classiﬁer outperforms existing methods on a variety of multiclass text classiﬁcation tasks. [sent-8, score-0.362]

7 1 Introduction In the multiclass text classiﬁcation task, we are given a training set of documents, each labeled as belonging to one of K disjoint classes, and a new unlabeled test document. [sent-9, score-0.409]

8 Using the training set as a guide, we must predict the most likely class for the test document. [sent-10, score-0.173]

9 “Bag-of-words” linear text classiﬁers represent a document as a vector x of word counts, and predict the class whose score (a linear function of x) is highest, i. [sent-11, score-0.424]

10 Choosing parameters {θki } which give high classiﬁcation accuracy on test data, thus, is the main challenge for linear text classiﬁcation algorithms. [sent-17, score-0.256]

11 In this paper, we focus on linear text classiﬁcation algorithms in which the parameters are pre-speciﬁed functions of training set statistics; that is, each θki is a function θki := g(uki ) of some ﬁxed statistics uki of the training set. [sent-18, score-1.046]

12 Unlike discriminative learning methods, such as logistic regression [1] or support vector machines (SVMs) [2], which use numerical optimization to pick parameters, the learners we consider perform no optimization. [sent-19, score-0.234]

13 Rather, in our technique, parameter learning involves tabulating statistics vectors {u ki } and applying the closed-form function g to obtain parameters. [sent-20, score-0.396]

14 We refer to g, this mapping from statistics to parameters, as the parameter function. [sent-21, score-0.14]

15 Many common text classiﬁcation methods—including the multinomial and multivariate Bernoulli event models for naive Bayes [3], the vector space-based TFIDF classiﬁer [4], and its probabilistic variant, PrTFIDF [5]—belong to this class of algorithms. [sent-22, score-0.454]

16 Here, picking a good text classiﬁer from this class is equivalent to ﬁnding the right parameter function for the available statistics. [sent-23, score-0.315]

17 In practice, researchers often develop text classiﬁcation algorithms by trial-and-error, guided by empirical testing on real-world classiﬁcation tasks (cf. [sent-24, score-0.211]

18 In this paper, we consider the task of automatically learning a parameter function g for text classiﬁcation. [sent-30, score-0.292]

19 Given a set of example text classiﬁcation problems, we wish to “metalearn” a new learning algorithm (as speciﬁed by the parameter function g), which may then be applied new classiﬁcation problems. [sent-31, score-0.262]

20 In low training data classiﬁcation tasks, the learning algorithm given by our automatically learned parameter function consistently outperforms human-designed parameter functions based on naive Bayes and TFIDF, as well as existing discriminative learning approaches. [sent-34, score-0.602]

21 A labeled document is a pair (x, y) ∈ X × Y, where x is an n-dimensional vector with xi indicating the number of occurrences of word wi in the document, and y is the document’s class label. [sent-42, score-0.392]

22 A classiﬁcation problem is a tuple D, S, (xtest , ytest ) , where D is a distribution over X × Y, S = {(xi , yi )}M is a set of M training examples, (xtest , ytest ) is a single test example, and i=1 all M + 1 examples are drawn iid from D. [sent-43, score-0.319]

23 Given a training set S and a test input vector xtest , we must predict the value of the test class label ytest . [sent-44, score-0.52]

24 In linear classiﬁcation algorithms, we evaluate the score fk (xtest ) := i θki xtest i for assigning xtest to each class k ∈ {1, . [sent-45, score-0.583]

25 , K} and pick the class y = arg maxk fk (xtest ) with the highest score. [sent-48, score-0.162]

26 In our meta-learning setting, we deﬁne each θki as the component-wise evaluation of the parameter function g on some vector of training set statistics u ki :     θk1 g(uk1 )  θk2   g(uk2 )   . [sent-49, score-0.439]

27 , n) is a vector whose components are computed from the training set S (we will provide speciﬁc examples later). [sent-62, score-0.149]

28 Furthermore, g : Rq → R is the parameter function mapping from uki to its corresponding parameter θki . [sent-63, score-0.775]

29 To illustrate these deﬁnitions, we show that two speciﬁc cases of the naive Bayes and TFIDF classiﬁcation methods belong to the class of algorithms described above. [sent-64, score-0.224]

30 Naive Bayes: In the multinomial variant of the naive Bayes classiﬁcation algorithm, 1 the score for assigning a document x to class k is n NB fk (x) := log p(y = k) + i=1 xi log p(wi | y = k). [sent-65, score-0.596]

31 This view has proved useful for analysis of naive Bayes even when none of its probabilistic assumptions hold [9]; here, we adopt this view, without attaching any particular probabilistic meaning to the empirical frequencies p(·) that happen to be computed by the algorithm. [sent-67, score-0.212]

32 For balanced training sets, the ﬁrst term is NB irrelevant. [sent-69, score-0.104]

33 2 As before, we write fk (x) = i θki xi with θki = gTFIDF (uki ). [sent-73, score-0.144]

34 The statistics vector is again deﬁned as in (3), but this time, gTFIDF (uki ) := uki1 uki4 log uki5 uki2 2 . [sent-74, score-0.128]

35 These include most other variants of TFIDF based on different TF and IDF terms [7], PrTFIDF [5], and various heuristically modiﬁed versions of naive Bayes [6]. [sent-76, score-0.217]

36 3 Learning the parameter function In the last section, we gave two examples of algorithms that obtain their parameters θ ki by applying a function g to a statistics vector uki . [sent-77, score-1.046]

37 In each case, the parameter function was hand-designed, either from probabilistic (in the case of naive Bayes [3]) or geometric (in the case of TFIDF [4]) considerations. [sent-78, score-0.246]

38 We now consider the problem of automatically learning a parameter function from example classiﬁcation tasks. [sent-79, score-0.119]

39 In the sequel, we assume ﬁxed statistics vectors {uki } and focus on ﬁnding an optimal parameter function g. [sent-80, score-0.153]

40 In the standard supervised learning setting, we are given a training set of examples sampled from some unknown distribution D, and our goal is to use the training set to make a prediction on a new test example also sampled from D. [sent-81, score-0.326]

41 By using the training examples to understand the statistical regularities in D, we hope to predict ytest from xtest with low error. [sent-82, score-0.422]

42 Analogously, the problem of meta-learning g is again a supervised learning task; here, however, the training “examples” are now classiﬁcation problems sampled from a distribution D over classiﬁcation problems. [sent-83, score-0.169]

43 3 By seeing many instances of text classiﬁcation problems 2 TFIDF Note that (5) implicitly deﬁnes fk (x) as a dot product of two vectors, each of whose components consist of a product of two terms. [sent-84, score-0.317]

44 In the normalized TFIDF classiﬁer, both vectors are normalized to unit length before computing the dot product, a modiﬁcation that makes the algorithm more stable for documents of varying length. [sent-85, score-0.158]

45 3 Note that in our meta-learning problem, the output of our algorithm is a parameter function g mapping statistics to parameters. [sent-87, score-0.14]

46 Our training data, however, do not explicitly indicate the best parameter function g ∗ for each example classiﬁcation problem. [sent-88, score-0.14]

47 Effectively then, in the meta-learning task, the central problem is to ﬁt g to some unseen g∗ , based on test examples in each training classiﬁcation problem. [sent-89, score-0.147]

48 drawn from D, we hope to learn a parameter function g that exploits the statistical regularities in problems from D. [sent-90, score-0.104]

49 For a new, test classiﬁcation problem Dtest , Stest , (xtest , ytest ) sampled independently from D, we desire that our learned g correctly classify xtest with high probability. [sent-92, score-0.41]

50 Using the linearity assumption, we pose a convex optimization problem for ﬁnding a parameter function g that achieves small loss on test examples in the training collection. [sent-94, score-0.229]

51 1 Softmax learning Recall that in softmax regression, the class probabilities p(y | x) are modeled as exp( i θki xi ) , i θk i xi ) k exp( p(y = k | x; {θki }) := k = 1, . [sent-97, score-0.381]

52 , K, (7) where the parameters {θki } are learned from the training data S by maximizing the conditional log likelihood of the data. [sent-100, score-0.21]

53 To pose our optimization problem, we start by learning the linear form g(uki ) = β T uki . [sent-104, score-0.686]

54 Under this parameterization, the conditional likelihood of an example (x, y) is p(y = k | x; β) = exp( k i exp( β T uki xi ) i β T uk i xi ) , k = 1, . [sent-105, score-0.8]

55 (8) In this setup, one natural approach for learning a linear function g is to maximize the (regularized) conditional log likelihood (β : S ) for the entire collection S : m j=1 log p(y (j) | x(j) ; β) − C||β||2   (j) (j) m exp β T i uy(j) i xi  − C||β||2 . [sent-109, score-0.253]

56 = log  (j) (j) exp β T i uki xi j=1 k (β : S ) := (9) In (9), the latter term corresponds to a Gaussian prior on the parameters β, which provides a means for controlling the complexity of the learned parameter function g. [sent-110, score-0.953]

57 The maximization of (9) is similar to softmax regression training except that here, instead of optimizing over the parameters {θki } directly, we optimize over the choice of β. [sent-111, score-0.288]

58 By the Representer Theorem [10], there exists a maximizing solution to (9) for which the optimal parameter vector β ∗ is a linear combination of training set statistics: β∗ = m j=1 k ∗ αjk (j) (j) i uki xi . [sent-114, score-0.877]

59 (10) From this, we reparameterize the original optimization over β in (9) as an equivalent optimization over training example weights {αjk }. [sent-115, score-0.13]

60 For notational convenience, let K(j, j , k, k ) := (j) (j ) i i xi xi (j) (j ) (uki )T uk i . [sent-116, score-0.169]

61 8 1 −30 −35 (b) −30 −25 −20 −15 uk1 −10 −5 0 Figure 1: Distribution of unnormalized uki vectors in dmoz data (a) with and (b) without applying the log transformation in (15). [sent-125, score-0.88]

62 Substituting (10) and (11) into (9), we obtain  m exp ({αjk } : S ) := log  j =1 k exp m m j=1 k αjk K(j, j , k, y (j ) ) m j=1 k αjk K(j, j , k, k ) m −C αjk αj j=1 j =1 k k   K(j, j , k, k ). [sent-129, score-0.123]

63 The assumption that g is a linear function of uki , however, places a severe restriction on the class of learnable parameter functions. [sent-131, score-0.749]

64 4 In particular, choosing a Gaussian (RBF) kernel, K(u, v) := exp(−γ||u − v||2 ), gives a non-parametric representation for g: g(uki ) = β T Φ(uki ) = m j=1 (j) k i (j) αjk xi exp(−γ||uki − uki ||2 ). [sent-133, score-0.704]

65 (14) (j) {αjk xi }, Thus, g(uki ) is a weighted combination of the values where the weights depend (j) exponentially on the squared 2 -distance of uki to each of the statistics vectors {uki }. [sent-134, score-0.799]

66 4 Experiments To validate our method, we evaluated its ability to learn parameter functions on a variety of email and webpage classiﬁcation tasks in which the number of classes, K, was large (K = 10), and the number of number of training examples per class, m/K, was small (m/K = 2). [sent-136, score-0.241]

67 We used the dmoz Open Directory Project hierarchy,5 the 20 Newsgroups dataset,6 the Reuters-21578 dataset,7 and the Industry Sector dataset8 . [sent-137, score-0.138]

68 gz Table 1: Test set accuracy on dmoz categories. [sent-155, score-0.166]

69 Columns 2-4 give the proportion of correct classiﬁcations using non-discriminative methods: the learned g, Naive Bayes, and TFIDF, respectively. [sent-156, score-0.105]

70 Columns 5-7 give the corresponding values for the discriminative methods: softmax regression, 1-vs-all SVMs, and multiclass SVMs. [sent-157, score-0.303]

71 397 The dmoz project is a hierarchical collection of webpage links organized by subject matter. [sent-261, score-0.193]

72 9 Given a dataset of documents, we sampled 10-class 2-training-examples-per-class classiﬁcation problems by randomly selecting 10 different classes within the dataset, picking 2 training examples within each class, and choosing one test example from a randomly chosen class. [sent-265, score-0.227]

73 For simplicity, we reduced the feature vector in (3) to the following two-dimensional representation,10 uki = log(proportion of wi among words from documents of class k) . [sent-268, score-0.917]

74 log(proportion of documents containing wi ) (15) Note that up to the log transformation, the components of uki correspond to the relative term frequency and document frequency of a word relative to class k (see Figure 1). [sent-269, score-1.126]

75 2 Generalization performance We tested our meta-learning algorithm on classiﬁcation problems taken from each of the 16 top-level dmoz categories. [sent-271, score-0.161]

76 Dataset gdmoz gnews greut gindu gNB gTFIDF softmax 1VA-SVM MC-SVM dmoz n/a 0. [sent-277, score-0.438]

77 To assess the accuracy of our meta-learning algorithm for a particular test category, we used the g learned from a set of 450 classiﬁcation problems drawn from the other 15 top-level categories. [sent-310, score-0.145]

78 In 15 out of 16 categories, the learned parameter function g outperforms naive Bayes and TFIDF in addition to the discriminative methods we tested (softmax regression, 1-vs-all SVMs [12], and multiclass SVMs [13]12 ; see Table 1). [sent-312, score-0.47]

79 Here, for each of the four corpora (dmoz, 20 Newsgroups, Reuters-21578, Industry Sector), we constructed independent training and testing datasets of 480 random classiﬁcation problems. [sent-314, score-0.151]

80 After training separate classiﬁers (gdmoz , gnews , greut , and gindu ) using data from each of the four corpora, we tested the performance of each learned classiﬁer on the remaining three corpora (see Table 2). [sent-315, score-0.331]

81 Again, the learned parameter functions compare favorably to the other methods. [sent-316, score-0.121]

82 Moreover, these tests show that a single parameter function may give an accurate classiﬁcation algorithm for many different corpora, demonstrating the effectiveness of our approach for achieving transfer across related learning tasks. [sent-317, score-0.171]

83 5 Discussion and Related Work In this paper, we presented an algorithm based on softmax regression for learning a parameter function g from example classiﬁcation problems. [sent-318, score-0.271]

84 Another approach for learning g is to modify the multiclass support vector machine formulation of Crammer and Singer [13] in a manner analagous to the modiﬁcation of softmax regression in Section 3. [sent-320, score-0.372]

85 1, giving the following quadratic program: minimize 1 ||β||2 + C j ξj 2 n m β∈R ,ξ∈R subject to βT (j) i (j) (j) xi (uy(j) i − uki ) ≥ I{k=y(j) } − ξj , ∀k, ∀j. [sent-321, score-0.704]

86 We implemented this method and found that the learned g, like in the softmax formulation, outperforms naive Bayes, TFIDF, and the other discriminative methods. [sent-323, score-0.456]

87 The techniques described in this paper give one approach for achieving inductive transfer in classiﬁer design—using labeled data from related example classiﬁcation problems to solve a particular classiﬁcation problem [16, 17]. [sent-324, score-0.151]

88 [18] also consider the issue of knowledge transfer in text classiﬁcation in the context of ensemble classiﬁers, and propose a system for using related classiﬁcation problems to learn the reliability of individual classiﬁers within the ensemble. [sent-326, score-0.303]

89 The same holdout set was used to select regularization parameters for the discriminative learning algorithms. [sent-328, score-0.153]

90 m/K = 10), softmax and multiclass SVMs consistently outperform naive Bayes and TFIDF; nevertheless, the learned g achieves a performance on par with discriminative methods, despite being constrained to parameters which are explicit functions of training data statistics. [sent-332, score-0.636]

91 This result is consistent with a previous study in which a heuristically hand-tuned version of Naive Bayes attained near-SVM text classiﬁcation performance for large datasets [6]. [sent-333, score-0.2]

92 Though not directly applied to text classiﬁcation, Teevan and Karger [19] consider the problem of automatically learning term distributions for use in information retrieval. [sent-335, score-0.256]

93 A similar concept is implicit in the kernelized parameter function learned by our algorithm, where the Gaussian kernel facilitates transfer between similar statistics vectors. [sent-338, score-0.287]

94 A comparison of event models for Naive Bayes text classiﬁcation. [sent-355, score-0.173]

95 A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. [sent-364, score-0.197]

96 Tackling the poor assumptions of naive Bayes text classiﬁers. [sent-373, score-0.337]

97 generative classiﬁers: a comparison of logistic regression and naive Bayes. [sent-388, score-0.202]

98 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-417, score-0.133]

99 Inductive transfer for text classiﬁcation using generalized reliability indicators. [sent-451, score-0.28]

100 Empirical development of an exponential probabilistic model for text retrieval: Using textual analysis to build a better model. [sent-457, score-0.197]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('uki', 0.631), ('tfidf', 0.308), ('classi', 0.216), ('ki', 0.212), ('xtest', 0.197), ('text', 0.173), ('naive', 0.164), ('softmax', 0.144), ('dmoz', 0.138), ('cation', 0.127), ('documents', 0.117), ('bayes', 0.114), ('multiclass', 0.1), ('document', 0.088), ('ytest', 0.086), ('jk', 0.084), ('transfer', 0.082), ('training', 0.082), ('gnb', 0.079), ('gtfidf', 0.079), ('sector', 0.079), ('wi', 0.076), ('xi', 0.073), ('fk', 0.071), ('corpora', 0.069), ('learned', 0.063), ('industry', 0.063), ('class', 0.06), ('teevan', 0.059), ('discriminative', 0.059), ('newsgroups', 0.058), ('parameter', 0.058), ('statistics', 0.054), ('er', 0.05), ('ers', 0.049), ('svms', 0.042), ('proportion', 0.042), ('vectors', 0.041), ('log', 0.041), ('exp', 0.041), ('gdmoz', 0.039), ('gindu', 0.039), ('gnews', 0.039), ('greut', 0.039), ('holdout', 0.039), ('libsvm', 0.039), ('prtfidf', 0.039), ('uy', 0.039), ('category', 0.039), ('word', 0.039), ('regression', 0.038), ('categories', 0.038), ('tasks', 0.038), ('tc', 0.034), ('rq', 0.034), ('examples', 0.034), ('stanford', 0.034), ('kn', 0.034), ('sampled', 0.033), ('vector', 0.033), ('learning', 0.031), ('test', 0.031), ('maxk', 0.031), ('score', 0.031), ('kernel', 0.03), ('automatically', 0.03), ('bennett', 0.029), ('unnormalized', 0.029), ('webpage', 0.029), ('accuracy', 0.028), ('mapping', 0.028), ('nb', 0.027), ('heuristically', 0.027), ('assigning', 0.027), ('variants', 0.026), ('outperforms', 0.026), ('collection', 0.026), ('discarded', 0.026), ('sigir', 0.026), ('support', 0.026), ('frequency', 0.026), ('reliability', 0.025), ('vocabulary', 0.025), ('product', 0.025), ('nes', 0.025), ('optimization', 0.024), ('parameters', 0.024), ('picking', 0.024), ('crammer', 0.024), ('probabilistic', 0.024), ('uk', 0.023), ('inductive', 0.023), ('regularities', 0.023), ('categorization', 0.023), ('labeled', 0.023), ('numerical', 0.023), ('http', 0.023), ('problems', 0.023), ('inner', 0.023), ('term', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 195 nips-2005-Transfer learning for text classification

Author: Chuong B. Do, Andrew Y. Ng

2 0.15491892 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efﬁcient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classiﬁcation data sets show that the proposed approach is promising.

3 0.1261327 105 nips-2005-Large-Scale Multiclass Transduction

Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan

Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufﬁciently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1

4 0.1144027 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classiﬁcation by semideﬁnite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difﬁculty, we ﬁnd that metrics trained in this way lead to signiﬁcant improvements in kNN classiﬁcation—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modiﬁcation or extension for problems in multiway (as opposed to binary) classiﬁcation. 1

5 0.096336454 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

Author: Sathiya Keerthi, Wei Chu

Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efﬁciency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1

6 0.092525735 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

7 0.090719022 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

8 0.088631153 37 nips-2005-Benchmarking Non-Parametric Statistical Tests

9 0.075551949 50 nips-2005-Convex Neural Networks

10 0.075276896 48 nips-2005-Context as Filtering

11 0.073191382 185 nips-2005-Subsequence Kernels for Relation Extraction

12 0.071337245 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

13 0.069073968 78 nips-2005-From Weighted Classification to Policy Search

14 0.06897369 161 nips-2005-Radial Basis Function Network for Multi-task Learning

15 0.068497971 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

16 0.06675265 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

17 0.065044805 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

18 0.063399024 52 nips-2005-Correlated Topic Models

19 0.062455226 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

20 0.061782233 184 nips-2005-Structured Prediction via the Extragradient Method

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.201), (1, 0.092), (2, -0.04), (3, -0.014), (4, 0.086), (5, 0.014), (6, 0.16), (7, 0.055), (8, 0.041), (9, -0.015), (10, -0.06), (11, -0.167), (12, 0.073), (13, -0.056), (14, -0.018), (15, -0.015), (16, 0.022), (17, -0.022), (18, -0.001), (19, 0.04), (20, -0.061), (21, -0.026), (22, -0.038), (23, -0.03), (24, -0.095), (25, 0.014), (26, -0.055), (27, 0.039), (28, 0.081), (29, 0.046), (30, 0.077), (31, 0.071), (32, 0.097), (33, -0.109), (34, 0.028), (35, -0.053), (36, -0.133), (37, -0.052), (38, 0.061), (39, -0.09), (40, 0.075), (41, 0.028), (42, -0.014), (43, -0.154), (44, -0.008), (45, 0.021), (46, -0.096), (47, -0.01), (48, -0.079), (49, 0.153)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93011189 195 nips-2005-Transfer learning for text classification

Author: Chuong B. Do, Andrew Y. Ng

2 0.65658259 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

3 0.56446779 37 nips-2005-Benchmarking Non-Parametric Statistical Tests

Author: Mikaela Keller, Samy Bengio, Siew Y. Wong

Abstract: Although non-parametric tests have already been proposed for that purpose, statistical signiﬁcance tests for non-standard measures (different from the classiﬁcation error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. 1

4 0.5613513 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

5 0.54867458 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

Author: François Laviolette, Mario Marchand, Mohak Shah

Abstract: We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classiﬁers achieving a non trivial margin-sparsity trade-oﬀ. 1

6 0.53814757 105 nips-2005-Large-Scale Multiclass Transduction

7 0.49884903 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

8 0.46655145 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

9 0.45151058 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

10 0.44220105 185 nips-2005-Subsequence Kernels for Relation Extraction

11 0.41632575 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

12 0.41406074 114 nips-2005-Learning Rankings via Convex Hull Separation

13 0.41194004 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

14 0.41100609 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

15 0.4099035 131 nips-2005-Multiple Instance Boosting for Object Detection

16 0.4054769 50 nips-2005-Convex Neural Networks

17 0.4045535 151 nips-2005-Pattern Recognition from One Example by Chopping

18 0.40360618 171 nips-2005-Searching for Character Models

19 0.38814232 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

20 0.38083866 161 nips-2005-Radial Basis Function Network for Multi-task Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.051), (5, 0.01), (10, 0.033), (25, 0.285), (27, 0.035), (31, 0.043), (34, 0.125), (41, 0.011), (50, 0.012), (55, 0.027), (69, 0.051), (73, 0.039), (88, 0.142), (91, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93160808 4 nips-2005-A Bayesian Spatial Scan Statistic

Author: Daniel B. Neill, Andrew W. Moore, Gregory F. Cooper

Abstract: We propose a new Bayesian method for spatial cluster detection, the “Bayesian spatial scan statistic,” and compare this method to the standard (frequentist) scan statistic approach. We demonstrate that the Bayesian statistic has several advantages over the frequentist approach, including increased power to detect clusters and (since randomization testing is unnecessary) much faster runtime. We evaluate the Bayesian and frequentist methods on the task of prospective disease surveillance: detecting spatial clusters of disease cases resulting from emerging disease outbreaks. We demonstrate that our Bayesian methods are successful in rapidly detecting outbreaks while keeping number of false positives low. 1

2 0.84678501 169 nips-2005-Saliency Based on Information Maximization

Author: Neil Bruce, John Tsotsos

Abstract: A model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from a scene. The proposed operation is based on Shannon's self-information measure and is achieved in a neural circuit, which is demonstrated as having close ties with the circuitry existent in the primate visual cortex. It is further shown that the proposed saliency measure may be extended to address issues that currently elude explanation in the domain of saliency based models. Resu lts on natural images are compared with experimental eye tracking data revealing the efficacy of the model in predicting the deployment of overt attention as compared with existing efforts.

same-paper 3 0.75783432 195 nips-2005-Transfer learning for text classification

Author: Chuong B. Do, Andrew Y. Ng

4 0.59528875 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard

Abstract: There have been many graph-based approaches for semi-supervised classiﬁcation. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classiﬁcation. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classiﬁcation as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classiﬁcation. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classiﬁer to classify new points. Tests on synthetic and real datasets show cases where there are signiﬁcant improvements in performance over the existing approaches. 1

5 0.59128737 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

6 0.58654249 184 nips-2005-Structured Prediction via the Extragradient Method

7 0.58479589 50 nips-2005-Convex Neural Networks

8 0.58443791 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

9 0.58311254 23 nips-2005-An Application of Markov Random Fields to Range Sensing

10 0.58129078 114 nips-2005-Learning Rankings via Convex Hull Separation

11 0.5799157 45 nips-2005-Conditional Visual Tracking in Kernel Space

12 0.57991481 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining

13 0.57839203 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

14 0.57820743 105 nips-2005-Large-Scale Multiclass Transduction

15 0.57739508 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

16 0.57725137 144 nips-2005-Off-policy Learning with Options and Recognizers

17 0.57700527 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

18 0.57686585 151 nips-2005-Pattern Recognition from One Example by Chopping

19 0.57607406 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

20 0.57597613 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs