nips nips2008 nips2008-242 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 hk Abstract This paper investigates a new machine learning strategy called translated learning. [sent-6, score-0.227]
2 Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. [sent-7, score-0.292]
3 For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. [sent-8, score-0.469]
4 An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. [sent-9, score-0.679]
5 The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. [sent-10, score-0.634]
6 Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. [sent-11, score-0.147]
7 Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. [sent-13, score-0.377]
8 1 Introduction Traditional machine learning relies on the availability of a large amount of labeled data to train a model in the same feature space. [sent-14, score-0.24]
9 However, labeled data are often scarce and expensive to obtain. [sent-15, score-0.156]
10 In order to save much labeling work, various machine learning strategies have been proposed, including semi-supervised learning [13], transfer learning [3, 11, 10], self-taught learning [9], etc. [sent-16, score-0.191]
11 One commonality among these strategies is they all require the training data and test data to be in the same feature space. [sent-17, score-0.161]
12 However, in practice, we often face the problem where the labeled data are scarce in its own feature space, whereas there are sufficient labeled data in other feature spaces. [sent-19, score-0.424]
13 For example, there may be few labeled images available, but there are often plenty of labeled text documents on the Web (e. [sent-20, score-0.445]
14 Another example is cross-language classification where labeled documents in English are much more than ones in some other languages such as Bangla, which has only 21 Web pages in the ODP. [sent-25, score-0.201]
15 Therefore, it would be great if we could learn the knowledge across different feature spaces and to save a substantial labeling effort. [sent-26, score-0.142]
16 To address the transferring of knowledge across different feature spaces, researchers have proposed multi-view learning [2, 8, 7] in which each instance has multiple views in different feature spaces. [sent-27, score-0.183]
17 Different from multi-view learning, in this paper, we focus on the situation where the training data are in a source feature space, and the test data are in a different target feature space, and that there is no correspondence between instances in these spaces. [sent-28, score-0.397]
18 The source and target feature spaces can be (a) Supervised Learning (b) Semi-supervised Learning (c) Transfer Learning (d) Self-taught Learning Elephants are large and gray . [sent-29, score-0.277]
19 Test Data Figure 1: An intuitive illustration to different kinds of learning strategies using classification of image elephants and rhinos as the example. [sent-41, score-0.148]
20 The images in orange frames are labeled data, while the ones without frames are unlabeled data. [sent-42, score-0.201]
21 To solve this novel learning problem, we develop a novel framework named as translated learning, where training data and test data can be in totally different feature spaces. [sent-44, score-0.42]
22 A translator is needed to be exploited to link the different feature spaces. [sent-45, score-0.309]
23 Clearly, the translated learning framework is more general and difficult than traditional learning problems. [sent-46, score-0.272]
24 Figure 1 presents an intuitive illustration of six different learning strategies, including supervised learning, semi-supervised learning [13], transfer learning [10], self-taught learning [9], multi-view learning [2], and finally, translated learning. [sent-47, score-0.378]
25 An intuitive idea for translated learning is to somehow translate all the training data into a target feature space, where learning can be done within a single feature space. [sent-48, score-0.538]
26 However, for the more general translated learning problem, this idea is hard to be realized, since machine translation between different feature spaces is very difficult to accomplish in many non-natural language cases, such as translating documents to images. [sent-50, score-0.5]
27 Furthermore, while a text corpus can be exploited for cross-langauge translation, for translated learning, the learning of the “feature-space translator” from available resources is a key issue. [sent-51, score-0.296]
28 Our solution is to make the best use of available data that have both features of the source and target domains in order to construct a translator. [sent-52, score-0.175]
29 While these data may not be sufficient in building a good classifier for the target domain, as we will demonstrate in our experimental study in the paper, by leveraging the available labeled data in the source domain, we can indeed build effective translators. [sent-53, score-0.345]
30 An example is to translate between the text and image feature spaces using the social tagging data from Web sites such as Flickr (http://www. [sent-54, score-0.317]
31 The main contribution of our work is to combine the feature translation and the nearest neighbor learning into a unified model by making use of a language model [5]. [sent-57, score-0.148]
32 In translated learning, the training data xs are represented by the features ys in the source feature space, while the test data xt are represented by the features yt in the target feature space. [sent-59, score-1.921]
33 We model the learning in the source space through a Markov chain c → ys → xs , which can be connected to another Markov chain c → yt → xt in the target space. [sent-60, score-1.624]
34 An important contribution of our work then is to show how to connect these two paths, so that the new chain c → ys → yt → xt , can be used to translate the knowledge from the source space to the target one, where the mapping ys → yt is acting as a feature-level translator. [sent-61, score-2.178]
35 In our final solution, which we call TLRisk, we exploit the risk minimization framework in [5] to model translated learning. [sent-62, score-0.325]
36 1 Translated Learning Framework Problem Formulation We first define the translated learning problem formally. [sent-65, score-0.227]
37 In this (1) (n ) (i) space, each instance xs ∈ Xs is represented by a feature vector (ys , . [sent-67, score-0.251]
38 , ys s ), where ys ∈ Ys and Ys is the source feature space. [sent-70, score-0.885]
39 Let Xt be the target instance space, in which each instance (1) (n ) (i) xt ∈ Xt is represented by a feature vector (yt , . [sent-71, score-0.64]
40 , yt t ), where yt ∈ Yt and Yt is the target (i) (i) feature space. [sent-74, score-0.898]
41 We have a labeled training data set Ls = {(xs , cs )}n in the source space, where i=1 (i) (i) (i) xs ∈ Xs and cs ∈ C = {1, . [sent-75, score-0.392]
42 We also have another labeled (i) (i) (i) (i) training data set Lt = {(xt , ct )}m in the target space, where xt ∈ Xt and ct ∈ C. [sent-79, score-0.683]
43 Note that xs is in a different i=1 (i) (i) (i) (i) (i) feature space from xt and xu . [sent-82, score-0.727]
44 For example, xs may be a text document, while xt and xu may be visual images. [sent-83, score-0.703]
45 To link the two feature spaces, a feature translator p(yt |ys ) ∝ φ(yt , ys ) is constructed. [sent-84, score-0.748]
46 However, for ease of explanation, we first assume that the translator φ is given, and will discuss the derivation of φ later in this section, based on co-occurrence data. [sent-85, score-0.208]
47 We focus on our main objective in learning, (i) which is to estimate a hypothesis ht : Xt → C to classify the instances xu ∈ U as accurately as possible, by making use of the labeled training data L = Ls ∪ Lt and the translator φ. [sent-86, score-0.452]
48 2 Risk Minimization Framework First, we formulate our objective in terms of how to minimize an expected risk function with respect to the labeled training data L = Ls ∪ Lt and the translator φ by extending the risk minimization framework in [5]. [sent-88, score-0.56]
49 In this work, we use the risk function R(c, xt ) to measure the the risk for classifying xt to the category c. [sent-89, score-1.097]
50 Therefore, to predict the label for an instance xt , we need only to find the class-label c which minimizes the risk function R(c, xt ), so that ht (xt ) = arg min R(c, xt ). [sent-90, score-1.475]
51 (1) c∈C The risk function R(c, xt ) can be formulate as the expected loss when c and xt are relevant; formally, R(c, xt ) ≡ L(r = 1|c, xt ) = ΘC Θ Xt L(θC , θXt , r = 1)p(θC |c) p(θXt |xt ) dθXt dθC . [sent-91, score-1.875]
52 (2) Here, r = 1 represents the event of “relevant”, which means (in Equation (2)) “c and xt are relevant”, or “the label of xt is c”. [sent-92, score-0.9]
53 θC and θXt are the models with respect to classes C and target space instances Xt respectively. [sent-93, score-0.137]
54 Note that, in Equation (2), θC only depends on c and θXt only depends to xt . [sent-95, score-0.45]
55 Thus, we use p(θC |c) to replace p(θC |c, xt ), and use p(θXt |xt ) to replace p(θXt |c, xt ). [sent-96, score-0.9]
56 Replacing L(θC , θXt , r = 1) with ∆(θC , θXt ), the risk function is reformulated as R(c, xt ) ∝ ΘC Θ Xt ∆(θC , θXt )p(θC |c) p(θXt |xt ) dθXt dθC . [sent-105, score-0.525]
57 In this paper, we approximate the risk function by its value at the posterior mode: ˆ ˆ ˆ ˆ ˆ ˆ ˆ R(c, xt ) ≈ ∆(θc , θx )p(θc |c)p(θx |xt ) ∝ ∆(θc , θx )p(θc |c), (4) t t t ˆ ˆ where θc = arg maxθC p(θC |c), and θxt = arg maxθXt p(θXt |xt ). [sent-107, score-0.525]
58 Output: The prediction label ht (xt ) for each xt ∈ U. [sent-111, score-0.478]
59 3: end for Procedure TLRisk test 1: for each xt ∈ U do ˆ 2: Estimate the model θxt based on Equation (7). [sent-113, score-0.45]
60 3: Predict the label ht (xt ) for xt based on Equations (1) and (5). [sent-114, score-0.478]
61 4: end for ˆ ˆ R(c, xt ) ∝ ∆(θc , θxt ), (5) ˆ ˆ ˆ ˆ where ∆(θc , θxt ) denotes the dissimilarity between two models θc and θxt . [sent-115, score-0.493]
62 To achieve this objective, as in [5], we formulate these two models in the target feature space Yt ; specifically, if we use KL ˆ ˆ ˆ ˆ divergence as the distance function, ∆(θc , θxt ) can be measured by KL(p(Yt |θc )||p(Yt |θxt )). [sent-116, score-0.17]
63 Integrating Equations (1), (5), (6) and (7), our translated learning framework is summarized as algorithm TLRisk, an abbreviation for Translated Learning via Risk Minimization, which is shown in Algorithm 1. [sent-119, score-0.249]
64 4 Translator φ We now explain in particular how to build the translator φ(yt , ys ) ∝ p(yt |ys ) to connect two different feature spaces. [sent-125, score-0.666]
65 As mentioned before, to estimate the translator p(yt |ys ), we need some cooccurrence data across the two feature spaces: source and target. [sent-126, score-0.398]
66 Formally, we need co-occurrence data in the form of p(yt , ys ), p(yt , xs ), p(xt , ys ), or p(xt , xs ). [sent-127, score-1.082]
67 In cross-language problems, dictionaries can be considered as data in the form of p(yt , ys ) (feature-level co-occurrence). [sent-128, score-0.392]
68 Here, horse vs coin indicates all the positive instances are about horse while all the negative instances are about coin. [sent-130, score-0.285]
69 Flickr, images associated with keywords) and search-engine results in response to queries are examples for correlational data in the forms of p(yt , xs ) and p(xt , ys ) (feature-instance co-occurrence). [sent-134, score-0.619]
70 Web pages including both text and pictures) is an example for data in the form of p(xt , xs ) (instance-level co-occurrence). [sent-137, score-0.251]
71 Where there is a pool of such co-occurrence data available, we can build the translator φ for estimating the Markov chains in the previous subsections. [sent-138, score-0.249]
72 The instance-level co-occurrence data can also be converted to feature-level co-occurrence; formally, p(yt , ys ) = Xt Xs p(xt , xs )p(ys |xs )p(yt |xt ) dxs dxt . [sent-140, score-0.612]
73 Using the feature-level co-occurrence probability p(yt , ys ), we can estimate the translator as p(yt |ys ) = p(yt , ys )/ Yt p(yt , ys )dyt . [sent-142, score-1.318]
74 3 Evaluation: Text-aided Image Classification In this section, we apply our framework TLRisk to a text-aided image classification problem, which uses binary labeled text documents as auxiliary data to enhance the image classification. [sent-143, score-0.512]
75 This problem is derived from the application where a user or a group of users may have expressed preferences over some text documents, and we wish to translate these preferences to images for the same group of users. [sent-144, score-0.161]
76 org/) were used in our evaluation, as the image and text corpora. [sent-150, score-0.146]
77 horse vs coin indicates all the positive instances are about horse while all the negative instances are about coin. [sent-156, score-0.285]
78 Based on the code-book, each image can be converted to a corresponding feature vector. [sent-162, score-0.146]
79 The collected data are in the form of feature-instance co-occurrence p(ys , xt ), so that we have to convert them to feature-level co-occurrence p(ys , yt ) as discussed in Section 2. [sent-165, score-0.848]
80 15 12 4 8 16 32 number of labeled images per category Image Only Search+Image TLRisk Lowerbound 0. [sent-176, score-0.204]
81 15 12 4 8 16 32 number of labeled images per category (a) 12 4 8 16 32 number of labeled images per category (b) (c) Figure 2: The average error rates over 12 data sets for text-aided image classification with different number of labeled images Lt . [sent-188, score-0.682]
82 The second baseline is to use the category name (in this case, there are two names for binary classification problems) to search for training images and then to train classifiers together with labeled images in Lt ; we refer to this model as Search+Image. [sent-213, score-0.362]
83 Note that this strategy, which is referred to as Lowerbound, is unavailable in our problem setting, since it uses a large amount of labeled data in the target space. [sent-217, score-0.228]
84 It indicates that our framework TLRisk can effectively learn knowledge across different feature spaces in the case of text-to-image classification. [sent-225, score-0.146]
85 In this experiment, we fixed the number of target training images per category to one, and set the threshold K (which is the number of images to collect for each text keyword, when collecting the co-occurrence data) to 40. [sent-234, score-0.335]
86 From the figure, we can see that, on one hand, when λ is very large, which means the classification model mainly builds on the target space training images Lt , the performance is rather poor. [sent-235, score-0.194]
87 On the other hand, when λ is small such that the classification model relies more on the auxiliary text training data Ls , the classification performance is relatively stable. [sent-236, score-0.138]
88 We focused on English-to-German classification, where English documents are used as the source data to help classify German documents, which are target data. [sent-239, score-0.268]
89 The dictionary data are in the form of feature-level co-occurrence p(yt , ys ). [sent-242, score-0.422]
90 We note that while most cross-language classification works rely on machine translation [1], our assumption is that the machine translation is unavailable and we rely on dictionary only. [sent-243, score-0.127]
91 Our framework TLRisk was compared to classification using only very few German labeled documents as a baseline, called German Labels Only. [sent-245, score-0.223]
92 In this experiment, we have only sixteen German labeled documents in each category. [sent-248, score-0.201]
93 multi-task learning [3], learning with auxiliary data sources [11], learning from irrelevant categories [10], and self-taught learning [9, 4]. [sent-296, score-0.135]
94 The translated learning proposed in this paper can be considered as an instance of general transfer learning; that is, transfer learning from data in different feature spaces. [sent-297, score-0.481]
95 However, as discussed before, multi-view learning requires that each instance should contain two views, while in translated learning, this requirement is relaxed. [sent-302, score-0.249]
96 6 Conclusions In this paper, we proposed a translated learning framework for classifying target data using data from another feature space. [sent-304, score-0.457]
97 We have shown that in translated learning, even though we have very little labeled data in the target space, if we can find a bridge to link the two spaces through feature translation, we can achieve good performance by leveraging the knowledge from the source data. [sent-305, score-0.664]
98 We formally formulated our translated learning framework using risk minimization, and presented an approximation method for model estimation. [sent-306, score-0.324]
99 The experimental results on the text-aided image classification and the cross-language classification show that our algorithm can greatly outperform the state-of-the-art baseline methods. [sent-308, score-0.128]
100 A comparative study on feature selection in text categorization. [sent-378, score-0.138]
wordName wordTfidf (topN-words)
[('xt', 0.45), ('tlrisk', 0.442), ('yt', 0.376), ('ys', 0.37), ('translator', 0.208), ('translated', 0.204), ('xs', 0.16), ('lowerbound', 0.13), ('labeled', 0.108), ('deutsch', 0.104), ('odp', 0.104), ('documents', 0.093), ('lt', 0.093), ('vs', 0.087), ('target', 0.077), ('image', 0.077), ('source', 0.076), ('risk', 0.075), ('text', 0.069), ('feature', 0.069), ('images', 0.067), ('classi', 0.063), ('transfer', 0.059), ('dyt', 0.057), ('german', 0.057), ('spaces', 0.055), ('ize', 0.052), ('horse', 0.049), ('top', 0.048), ('world', 0.043), ('dissimilarity', 0.043), ('web', 0.043), ('ls', 0.042), ('dai', 0.039), ('translation', 0.038), ('instances', 0.036), ('sport', 0.034), ('dxt', 0.034), ('xue', 0.034), ('chain', 0.034), ('link', 0.032), ('named', 0.032), ('cation', 0.031), ('cosine', 0.031), ('directory', 0.031), ('dictionary', 0.03), ('category', 0.029), ('pearson', 0.029), ('ht', 0.028), ('coin', 0.028), ('baseline', 0.027), ('kong', 0.026), ('unlabeled', 0.026), ('ballsport', 0.026), ('coefficient', 0.026), ('dxs', 0.026), ('elephants', 0.026), ('gesellschaft', 0.026), ('ncos', 0.026), ('ocation', 0.026), ('scarce', 0.026), ('skating', 0.026), ('wenyuan', 0.026), ('training', 0.026), ('translate', 0.025), ('hong', 0.025), ('english', 0.025), ('space', 0.024), ('xu', 0.024), ('greatly', 0.024), ('minimization', 0.024), ('health', 0.023), ('enhance', 0.023), ('internet', 0.023), ('yang', 0.023), ('learning', 0.023), ('qiang', 0.023), ('flickr', 0.023), ('cooccurrence', 0.023), ('data', 0.022), ('strategies', 0.022), ('framework', 0.022), ('instance', 0.022), ('leveraging', 0.021), ('accept', 0.021), ('unavailable', 0.021), ('supervisory', 0.021), ('kullback', 0.021), ('leibler', 0.021), ('auxiliary', 0.021), ('icml', 0.02), ('search', 0.02), ('nigam', 0.019), ('build', 0.019), ('shanghai', 0.018), ('train', 0.018), ('classifying', 0.018), ('markov', 0.018), ('language', 0.018), ('save', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1
2 0.37459287 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
3 0.23523208 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos
Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1
4 0.20662716 206 nips-2008-Sequential effects: Superstition or rational behavior?
Author: Angela J. Yu, Jonathan D. Cohen
Abstract: In a variety of behavioral tasks, subjects exhibit an automatic and apparently suboptimal sequential effect: they respond more rapidly and accurately to a stimulus if it reinforces a local pattern in stimulus history, such as a string of repetitions or alternations, compared to when it violates such a pattern. This is often the case even if the local trends arise by chance in the context of a randomized design, such that stimulus history has no real predictive power. In this work, we use a normative Bayesian framework to examine the hypothesis that such idiosyncrasies may reflect the inadvertent engagement of mechanisms critical for adapting to a changing environment. We show that prior belief in non-stationarity can induce experimentally observed sequential effects in an otherwise Bayes-optimal algorithm. The Bayesian algorithm is shown to be well approximated by linear-exponential filtering of past observations, a feature also apparent in the behavioral data. We derive an explicit relationship between the parameters and computations of the exact Bayesian algorithm and those of the approximate linear-exponential filter. Since the latter is equivalent to a leaky-integration process, a commonly used model of neuronal dynamics underlying perceptual decision-making and trial-to-trial dependencies, our model provides a principled account of why such dynamics are useful. We also show that parameter-tuning of the leaky-integration process is possible, using stochastic gradient descent based only on the noisy binary inputs. This is a proof of concept that not only can neurons implement near-optimal prediction based on standard neuronal dynamics, but that they can also learn to tune the processing parameters without explicitly representing probabilities. 1
5 0.20395653 57 nips-2008-Deflation Methods for Sparse PCA
Author: Lester W. Mackey
Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several deflation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets. 1
6 0.17530698 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
7 0.13263108 15 nips-2008-Adaptive Martingale Boosting
8 0.1168134 195 nips-2008-Regularized Policy Iteration
9 0.1079071 119 nips-2008-Learning a discriminative hidden part model for human action recognition
10 0.10274141 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
11 0.097906455 168 nips-2008-Online Metric Learning and Fast Similarity Search
12 0.094448008 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
13 0.090622231 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
14 0.087489888 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
15 0.080113791 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
16 0.071546413 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
17 0.069665119 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
18 0.066490583 113 nips-2008-Kernelized Sorting
19 0.066095658 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
20 0.065518849 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
topicId topicWeight
[(0, -0.191), (1, 0.009), (2, -0.053), (3, -0.134), (4, -0.15), (5, 0.387), (6, -0.01), (7, 0.188), (8, 0.216), (9, -0.176), (10, -0.033), (11, -0.238), (12, -0.04), (13, -0.127), (14, -0.027), (15, -0.008), (16, 0.023), (17, -0.058), (18, 0.065), (19, -0.053), (20, 0.06), (21, -0.065), (22, -0.118), (23, 0.008), (24, 0.091), (25, 0.058), (26, -0.035), (27, -0.04), (28, -0.043), (29, -0.036), (30, 0.024), (31, 0.061), (32, -0.075), (33, -0.016), (34, 0.1), (35, 0.042), (36, -0.025), (37, -0.034), (38, -0.032), (39, -0.032), (40, 0.045), (41, 0.042), (42, 0.012), (43, 0.048), (44, -0.038), (45, 0.062), (46, -0.006), (47, -0.052), (48, 0.027), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.96899122 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1
2 0.82652301 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
3 0.78651571 57 nips-2008-Deflation Methods for Sparse PCA
Author: Lester W. Mackey
Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several deflation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets. 1
4 0.66797143 206 nips-2008-Sequential effects: Superstition or rational behavior?
Author: Angela J. Yu, Jonathan D. Cohen
Abstract: In a variety of behavioral tasks, subjects exhibit an automatic and apparently suboptimal sequential effect: they respond more rapidly and accurately to a stimulus if it reinforces a local pattern in stimulus history, such as a string of repetitions or alternations, compared to when it violates such a pattern. This is often the case even if the local trends arise by chance in the context of a randomized design, such that stimulus history has no real predictive power. In this work, we use a normative Bayesian framework to examine the hypothesis that such idiosyncrasies may reflect the inadvertent engagement of mechanisms critical for adapting to a changing environment. We show that prior belief in non-stationarity can induce experimentally observed sequential effects in an otherwise Bayes-optimal algorithm. The Bayesian algorithm is shown to be well approximated by linear-exponential filtering of past observations, a feature also apparent in the behavioral data. We derive an explicit relationship between the parameters and computations of the exact Bayesian algorithm and those of the approximate linear-exponential filter. Since the latter is equivalent to a leaky-integration process, a commonly used model of neuronal dynamics underlying perceptual decision-making and trial-to-trial dependencies, our model provides a principled account of why such dynamics are useful. We also show that parameter-tuning of the leaky-integration process is possible, using stochastic gradient descent based only on the noisy binary inputs. This is a proof of concept that not only can neurons implement near-optimal prediction based on standard neuronal dynamics, but that they can also learn to tune the processing parameters without explicitly representing probabilities. 1
5 0.63846248 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos
Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1
6 0.48858729 15 nips-2008-Adaptive Martingale Boosting
7 0.47672203 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
8 0.46410486 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making
9 0.43744987 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
10 0.40318567 119 nips-2008-Learning a discriminative hidden part model for human action recognition
11 0.34607533 195 nips-2008-Regularized Policy Iteration
12 0.33387253 168 nips-2008-Online Metric Learning and Fast Similarity Search
13 0.29472727 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
14 0.27250913 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
15 0.27203813 113 nips-2008-Kernelized Sorting
16 0.26501378 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
17 0.25317714 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
18 0.25301093 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
19 0.25173372 244 nips-2008-Unifying the Sensory and Motor Components of Sensorimotor Adaptation
20 0.25164244 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
topicId topicWeight
[(6, 0.043), (7, 0.057), (12, 0.039), (15, 0.012), (28, 0.177), (38, 0.294), (57, 0.056), (59, 0.019), (63, 0.016), (71, 0.02), (77, 0.053), (78, 0.011), (83, 0.092)]
simIndex simValue paperId paperTitle
1 0.76806355 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos
Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1
2 0.76593316 181 nips-2008-Policy Search for Motor Primitives in Robotics
Author: Jens Kober, Jan R. Peters
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1
same-paper 3 0.76146454 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1
4 0.68370342 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
Author: Ofer Dekel
Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1
5 0.60194981 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
Author: Liu Yang, Rong Jin, Rahul Sukthankar
Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.
6 0.60038823 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference
7 0.60004789 194 nips-2008-Regularized Learning with Networks of Features
8 0.59960991 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
9 0.59959215 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
10 0.59958875 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
11 0.59693313 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
12 0.59603196 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
14 0.59551466 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
15 0.59533381 113 nips-2008-Kernelized Sorting
16 0.5949567 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
17 0.59454769 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
18 0.59445089 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
19 0.59431338 231 nips-2008-Temporal Dynamics of Cognitive Control
20 0.59367526 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries