nips nips2009 nips2009-130 knowledge-graph by maker-knowledge-mining

130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Source: pdf

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

Abstract: We address the problem of learning classiﬁers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classiﬁers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classiﬁers learned on examples with multiple artiﬁcially created views. Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. An extension of this framework is a natural way to leverage unlabeled multi-view data in semi-supervised learning. Experimental results on a subset of the Reuters RCV1/RCV2 collections support our ﬁndings by showing that additional views obtained from MT may signiﬁcantly improve the classiﬁcation performance in the cases identiﬁed by our trade-off. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We assume the existence of view generating functions which may complete the missing views in an approximate way. [sent-11, score-0.492]

2 This situation corresponds for example to learning text classiﬁers from multilingual collections where documents are not available in all languages. [sent-12, score-0.316]

3 Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. [sent-15, score-0.205]

4 As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. [sent-16, score-0.405]

5 Experimental results on a subset of the Reuters RCV1/RCV2 collections support our ﬁndings by showing that additional views obtained from MT may signiﬁcantly improve the classiﬁcation performance in the cases identiﬁed by our trade-off. [sent-18, score-0.377]

6 1 Introduction We study the learning ability of classiﬁers trained on examples generated from different sources, but where some observations are partially missing. [sent-19, score-0.118]

7 This problem occurs for example in non-parallel multilingual document collections, where documents may be available in different languages, but each document in a given language may not be translated in all (or any) of the other languages. [sent-20, score-0.376]

8 Our framework assumes the existence of view generating functions which may approximate missing examples using the observed ones. [sent-21, score-0.249]

9 In the case of multilingual corpora these view generating functions may be Machine Translation systems which for each document in one language produce its translations in all other languages. [sent-22, score-0.394]

10 From this result we induce a trade-off between the number of training examples, the number of views and the ability of view generating functions to produce accurate additional views. [sent-27, score-0.538]

11 This trade-off helps us identify situations in which artiﬁcially generated views may lead to substantial performance gains. [sent-28, score-0.338]

12 We then show how the agreement of classiﬁers over their class predictions on unlabeled training data may lead to a much tighter trade-off. [sent-29, score-0.198]

13 Section 4 describes our trade-off bound in the Empirical Risk Minimization (ERM) setting, and shows how and when the additional, artiﬁcially generated views may yield a better generalization performance in a supervised setting. [sent-33, score-0.422]

14 Section 5 shows how to exploit these results when additional unlabeled training data are available, in order to obtain a more accurate trade-off. [sent-34, score-0.166]

15 , xV ), where different views xv provide a representation of the same object in different sets Xv . [sent-41, score-0.686]

16 In the setting of multilingual classiﬁcation, each view is the textual representation of a document written in a given language (e. [sent-43, score-0.354]

17 We consider binary classiﬁcation problems where, given a multi-view observation, some of the views are not observed (we obviously require that at least one view is observed). [sent-46, score-0.433]

18 This happens, for instance, when documents may be available in different languages, yet a given document may only be available in a single language. [sent-47, score-0.154]

19 × (XV ∪ {⊥}), where xv =⊥ means that the v-th view is not observed. [sent-51, score-0.465]

20 def In binary classiﬁcation, we assume that examples are pairs (x, y), with y ∈ Y = {0, 1}, drawn according to a ﬁxed, but unknown distribution D over X × Y, such that P(x,y)∼D (∀v : xv =⊥) = 0 (at least one view is available). [sent-52, score-0.586]

21 In multilingual text classiﬁcation, a parallel corpus is a dataset where all views are always observed (i. [sent-53, score-0.549]

22 P(x,y)∼D (∃v : xv =⊥) = 0), while a comparable corpus is a dataset where only one view is available for each example (i. [sent-55, score-0.56]

23 For a given observation x, the views v such that xv =⊥ will be called the observed views. [sent-58, score-0.714]

24 The originality of our setting is that we assume that we have view generating functions Ψv→v′ : Xv → Xv′ which take as input a given view xv and output an element of Xv′ , that we assume is close ′ to what xv would be if it was observed. [sent-59, score-1.021]

25 In our multilingual text classiﬁcation example, the view generating functions are Machine Translation systems. [sent-60, score-0.329]

26 These generating functions can then be used to create surrogate observations, such that all views are available. [sent-61, score-0.38]

27 For a given partially observed x, the completed observation x is obtained as: ∀v, xv = xv ′ Ψv′ →v (xv ) if xv =⊥ ′ otherwise, where v ′ is such that xv =⊥ (1) In this paper, we focus on the case where only one view is observed for each example. [sent-62, score-1.658]

28 Our study extends to the situation where two or more views may be observed in a straightforward manner. [sent-64, score-0.361]

29 Our setting differs from previous multi-view learning studies [5] mainly on the straightforward generalization to more than two views and the use of view generating functions to induce the missing views from the observed ones. [sent-65, score-0.88]

30 Following the standard multi-view framework, in which all views are observed [3, 13], we assume that we are given V deterministic classiﬁer sets (Hv )V , each working on one speciﬁc view1 . [sent-68, score-0.359]

31 That is, for each view v, Hv is a set v=1 of functions hv : Xv → {0, 1}. [sent-69, score-0.781]

32 , hV , x) |∀v, hv ∈ Hv } For simplicity, in the rest of the paper, when the context is clear, the function x → ΦC (h1 , . [sent-74, score-0.672]

33 ,hV (x) = hv (xv ), where v is the observed view for x. [sent-91, score-0.792]

34 - Generated Views as Additional Training Data: The most natural way to use the generated views for learning is to use them as additional training material for the view-speciﬁc classiﬁers: ∀v, hv ∈ arg min e(h, (xv , y)) (4) h∈Hv (x,y)∈S with x deﬁned by eq. [sent-93, score-1.076]

35 - Multi-view Gibbs Classiﬁer: In order to avoid the potential bias introduced by the use of generated views only during training, we consider them also during testing. [sent-102, score-0.338]

36 This becomes a standard multi-view setting, where generated views are used exactly as if they were observed. [sent-103, score-0.338]

37 4), but the prediction is carried out with respect to the probability distribution of classes, by estimating the probability of class membership in class 1 from the mean prediction of each view-speciﬁc classiﬁer: ∀x, cmg V (x) = h1 ,. [sent-105, score-0.173]

38 ,h 1 1 V V hv (xv ) v=1 We assume deterministic view-speciﬁc classiﬁers for simplicity and with no loss of generality. [sent-108, score-0.69]

39 (5) - Multi-view Majority Voting: With view generating functions involved in training and test, a natural way to obtain a (generally) deterministic classiﬁer with improved performance is to take the majority vote associated with the Gibbs classiﬁer. [sent-109, score-0.36]

40 4, but the ﬁnal prediction is done using a majority vote: ∀x, cmv V h1 ,. [sent-111, score-0.149]

41 ,h (x) = 1 2 I if V v=1 v hv (x ) > V 2 V v=1 hv (xv ) = otherwise V 2 (6) Where I(. [sent-114, score-1.344]

42 4 The trade-offs with the ERM principle We now analyze how the generated views can improve generalization performance. [sent-117, score-0.361]

43 Essentially, the trade-off is that generated views offer additional training material, therefore potentially helping learning, but can also be of lower quality, which may degrade learning. [sent-118, score-0.404]

44 Theorem 1 Let D be a distribution over X × Y, satisfying P(x,y)∼D (|{v : xv =⊥}| = 1) = 0. [sent-127, score-0.373]

45 For each view v, denote v=1 m def e ◦ Hv = {(xv , y) → e(h, (xv , y))|h ∈ Hv }, and denote , for any sequence S v ∈ (Xv × Y) v of ˆ size mv , Rmv (e ◦ Hv , S v ) the empirical Rademacher complexity of e ◦ Hv on S v . [sent-133, score-0.231]

46 ,h′ ) + 2 h1 V mv ˆ Rmv (e ◦ Hv , S v ) + 6 m v=1 ln(2/δ) 2m def where, for all v, S v = {(xv , yi )|i = 1. [sent-140, score-0.123]

47 m and xv =⊥}, mv = |S v | and hv ∈ Hv is the i i classiﬁer minimizing the empirical risk on S v . [sent-142, score-1.142]

48 m}, hv ∈ Hv is the classiﬁer minimizing the i empirical risk on S v , and η = ′inf hv ∈Hv ǫ(cmg ′ ) − ′inf h′ ,. [sent-151, score-1.397]

49 ,h v 1 V Therefore η measures the loss incurred by using the view generating functions. [sent-171, score-0.142]

50 Majority voting One advantage of the multi-view setting at prediction time is that we can use a majority voting scheme, as described in Section 2. [sent-178, score-0.286]

51 5 Agreement-Based Semi-Supervised Learning One advantage of the multi-view settings described in the previous section is that unlabeled training examples may naturally be taken into account in a semi–supervised learning scheme, using existing approaches for multi-view learning (e. [sent-195, score-0.182]

52 In this section, we describe how, under the framework of [11], the supervised learning trade-off presented above can be improved using extra unlabeled examples. [sent-198, score-0.161]

53 This framework is based on the notion of disagreement between the various view-speciﬁc classiﬁers, deﬁned as the expected variance of their outputs:   2 1 1 def (8) V (h1 , . [sent-199, score-0.118]

54 , hV ) = E  hv (xv )2 − hv (xv )  V v (x,y)∼D V v The overall idea is that a set of good view-speciﬁc classiﬁers should agree on their predictions, making the expected variance small. [sent-202, score-1.373]

55 First, it does not depend on the true class labels, making its estimation easy over a large, unlabeled training set. [sent-204, score-0.159]

56 This suggests a simple way to do semi-supervised learning: the unlabeled data can be used to choose, among the classiﬁers minimizing the empirical risk on the labeled training set, those with best generalization performance (by choosing the classiﬁers with highest agreement on the unlabeled set). [sent-209, score-0.417]

57 This is particularly interesting when the number of labeled examples is small, as the train error is usually close to 0. [sent-210, score-0.124]

58 Theorem 3 of [11] provides a theoretical value B(ǫ, δ) for the minimum number of unlabeled examples required to estimate Eq. [sent-211, score-0.142]

59 The following result gives a tighter bound of the generalization error of the multi-view Gibbs classiﬁer when unlabeled data are available. [sent-215, score-0.123]

60 Under the conditions and notations of Theorem 1, assume furthermore that we have access to u ≥ B(µ/2, δ/2) unlabeled examples drawn i. [sent-218, score-0.142]

61 Then, with probability at least 1 − δ, if the empirical risk minimizers hv ∈ arg minh∈Hv (xv ,y)∈S v e(h, (xv , y)) have a disagreement less than µ/2 on the unlabeled set, we have: ǫ(cmg V ) ≤ ′inf h1 ,. [sent-222, score-0.864]

62 Also note that the more views we have, the greater the reduction in classiﬁer set complexity should be. [sent-231, score-0.313]

63 Notice that this semi-supervised learning principle enforces agreement between the view speciﬁc classiﬁers. [sent-232, score-0.131]

64 In the extreme case where they almost always give the same output, majority voting is then nearly equivalent to the Gibbs classiﬁer (when all voters agree, any vote is equal to the majority vote). [sent-233, score-0.31]

65 We therefore expect the majority vote and the Gibbs classiﬁer to yield similar performance in the semi-supervised setting. [sent-234, score-0.143]

66 This resulted in 12-30K documents per language, and 11-34K documents per class (see Table 1). [sent-241, score-0.149]

67 In addition, we reserved a test split containing 20% of the documents (respecting class and language proportions) for testing. [sent-242, score-0.131]

68 The artiﬁcial views were produced using Table 1: Distribution of documents over languages and classes in the comparable corpus. [sent-248, score-0.485]

69 Each document from the comparable corpus was thus translated to the other 4 languages. [sent-262, score-0.14]

70 We ﬁrst present experimental results obtained in supervised learning, using various amounts of labeled examples. [sent-264, score-0.123]

71 For comparisons, we employed the four learning strategies described in section 3: 1− the single-view baseline svb (Eq. [sent-266, score-0.207]

72 3), 2− generated views as additional training data gvb (Eq. [sent-267, score-0.476]

73 Recall that the second setting, gvb , is the most straightforward way to train and test classiﬁers when additional examples are available (or generated) from different sources. [sent-271, score-0.183]

74 It can thus be seen as a baseline approach, as opposed to the last two strategies (mvg and mvm ), where view-speciﬁc classiﬁers are both trained and tested over both original and translated documents. [sent-272, score-0.294]

75 Note also that in our case (V = 5 views), additional training examples obtained from machine translation represent 4 times as many labeled examples as the original texts used to train the baseline svb . [sent-273, score-0.482]

76 Table 2: Test classiﬁcation accuracy and F1 in the supervised setting, for both baselines (svb , gvb ), Gibbs (mvg ) and majority voting (mvw ) strategies, averaged over 10 random sets of 10 labeled examples per view. [sent-275, score-0.404]

77 644 mvm Results obtained in a supervised setting with only 10 labeled documents per language for training are summarized in table 2. [sent-332, score-0.458]

78 All learning strategies using the generated views during training outperform the single-view baseline. [sent-333, score-0.408]

79 This shows that, although imperfect, artiﬁcial views do bring additional information that compensates the lack of labeled data. [sent-334, score-0.401]

80 Although the multi-view Gibbs classiﬁer predicts based on a translation rather than the original in 80% of cases, it produces almost identical performance to the gvb run (which only predicts using the original text). [sent-335, score-0.127]

81 Multi-view majority voting reaches the best performance, yielding a 6 − 17% improvement in accuracy over the baseline. [sent-337, score-0.167]

82 These ﬁgures show that when there are enough labeled examples (around 500 for these 3 classes), the artiﬁcial views do not provide any additional useful information over the original-language examples. [sent-340, score-0.443]

83 When there are sufﬁcient original labeled examples, additional generated views do not provide more useful information for learning than what view-speciﬁc classiﬁers have available already. [sent-342, score-0.449]

84 We now investigate the use of unlabeled training examples for learning the view-speciﬁc classiﬁers. [sent-343, score-0.182]

85 Recall that in the case where view-speciﬁc classiﬁers are in agreement over the class labels of a large number of unlabeled examples, the multiview Gibbs and majority vote strategies should have the same performance. [sent-345, score-0.353]

86 In order to enforce agreement between classiﬁers on the unlabeled set, we use a variant of the iterative co-training algorithm [3]. [sent-346, score-0.139]

87 Given the view-speciﬁc classiﬁers trained on an initial set of labeled examples, we iteratively assign pseudo-labels to the unlabeled examples for which all classiﬁer predictions agree. [sent-347, score-0.237]

88 Key differences between this algorithm and co-training are the number of views used for learning (5 instead of 2), and the use of unanimous and simultaneous labeling. [sent-349, score-0.313]

89 size of the labeled training set for classes C15, ECAT and M11. [sent-385, score-0.124]

90 Prediction from the multi-view SVM models obtained from this s s self-learning multiple-view algorithm is done either using Gibbs (mvg ) or majority voting (mvm ). [sent-387, score-0.167]

91 For comparison we also trained a TSVM model [7] on each view separately, a semi-supervised equivalent to the single-view baseline strategy. [sent-389, score-0.172]

92 Note that the TSVM model mostly out-performs the supervised baseline svb , although the F1 suffers on some classes. [sent-390, score-0.238]

93 Table 3: Test classiﬁcation accuracy and F1 in the semi-supervised setting, for single-view TSVM s s and multi-view self-learning using either Gibbs (mvg ) or majority voting (mvm ), averaged over 10 random sets using 10 labeled examples per view to start. [sent-392, score-0.363]

94 For comparison we provide the single-view baseline and multi-view majority voting performance for supervised learning. [sent-393, score-0.275]

95 As expected, the performance of both mvg and mvm strategies are similar. [sent-461, score-0.348]

96 First, we proposed a bound on the risk of the Gibbs classiﬁer trained over artiﬁcially completed multi-view observations, which directly corresponds to our target application of learning text classiﬁers from a comparable corpus. [sent-463, score-0.144]

97 We showed that our bound may lead to a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. [sent-464, score-0.205]

98 Our result identiﬁes in which case it is advantageous to learn with additional artiﬁcial views, as opposed to sticking with the baseline setting in which a classiﬁer is trained over single view observations. [sent-465, score-0.222]

99 We showed that in the case where view-speciﬁc classiﬁers agree over the class labels of additional unlabeled training data, the previous trade-off becomes even much tighter. [sent-467, score-0.214]

100 Empirical results on a comparable multilingual corpus support our ﬁndings by showing that additional views obtained using a Machine Translation system may signiﬁcantly increase classiﬁcation performance in the most interesting situation, when there are few labeled data available for training. [sent-468, score-0.642]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hv', 0.672), ('xv', 0.373), ('views', 0.313), ('mvg', 0.159), ('mvm', 0.159), ('ers', 0.135), ('multilingual', 0.13), ('svb', 0.13), ('classi', 0.121), ('cmg', 0.101), ('ecat', 0.101), ('unlabeled', 0.1), ('view', 0.092), ('majority', 0.089), ('def', 0.079), ('voting', 0.078), ('gibbs', 0.078), ('tsvm', 0.076), ('cb', 0.075), ('gvb', 0.072), ('er', 0.066), ('documents', 0.065), ('labeled', 0.062), ('supervised', 0.061), ('ccat', 0.058), ('gcat', 0.058), ('translation', 0.055), ('vote', 0.054), ('languages', 0.051), ('generating', 0.05), ('baseline', 0.047), ('language', 0.047), ('mv', 0.044), ('reuters', 0.043), ('cmv', 0.043), ('rmv', 0.043), ('document', 0.043), ('rademacher', 0.042), ('examples', 0.042), ('text', 0.04), ('training', 0.04), ('inf', 0.039), ('disagreement', 0.039), ('agreement', 0.039), ('corpus', 0.038), ('collections', 0.038), ('risk', 0.037), ('cially', 0.036), ('comparable', 0.034), ('trained', 0.033), ('french', 0.033), ('english', 0.03), ('strategies', 0.03), ('agree', 0.029), ('nrc', 0.029), ('portage', 0.029), ('observed', 0.028), ('mt', 0.028), ('additional', 0.026), ('german', 0.025), ('generated', 0.025), ('translated', 0.025), ('du', 0.024), ('setting', 0.024), ('quality', 0.023), ('hardoon', 0.023), ('italian', 0.023), ('available', 0.023), ('generalization', 0.023), ('achievable', 0.023), ('classes', 0.022), ('multiview', 0.022), ('erm', 0.022), ('arti', 0.021), ('situation', 0.02), ('train', 0.02), ('unbalanced', 0.02), ('missing', 0.02), ('class', 0.019), ('interactive', 0.019), ('ndings', 0.019), ('cation', 0.018), ('deterministic', 0.018), ('crammer', 0.018), ('texts', 0.018), ('partially', 0.018), ('rm', 0.018), ('textual', 0.018), ('preferable', 0.018), ('prediction', 0.017), ('address', 0.017), ('functions', 0.017), ('complexities', 0.016), ('council', 0.016), ('tokens', 0.016), ('ln', 0.016), ('technologies', 0.016), ('system', 0.016), ('empirical', 0.016), ('corpora', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

2 0.10546049 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

Abstract: We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classiﬁer that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. 1

3 0.083421402 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

Author: Lei Shi, Thomas L. Griffiths

Abstract: The goal of perception is to infer the hidden states in the hierarchical process by which sensory data are generated. Human behavior is consistent with the optimal statistical solution to this problem in many tasks, including cue combination and orientation detection. Understanding the neural mechanisms underlying this behavior is of particular importance, since probabilistic computations are notoriously challenging. Here we propose a simple mechanism for Bayesian inference which involves averaging over a few feature detection neurons which ﬁre at a rate determined by their similarity to a sensory stimulus. This mechanism is based on a Monte Carlo method known as importance sampling, commonly used in computer science and statistics. Moreover, a simple extension to recursive importance sampling can be used to perform hierarchical Bayesian inference. We identify a scheme for implementing importance sampling with spiking neurons, and show that this scheme can account for human behavior in cue combination and the oblique effect. 1

4 0.079509713 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine

Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang

Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1

5 0.063046344 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Author: Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han

Abstract: Ensemble classiﬁers such as bagging, boosting and model averaging are known to have improved accuracy and robustness over a single model. Their potential, however, is limited in applications which have no access to raw data but to the meta-level model output. In this paper, we study ensemble learning with output from multiple supervised and unsupervised models, a topic where little work has been done. Although unsupervised models, such as clustering, do not directly generate label prediction for each individual, they provide useful constraints for the joint prediction of a set of related objects. We propose to consolidate a classiﬁcation solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the prediction over the graph, as well as penalizing deviations from the initial labeling provided by supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes. Our method can also be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on three real applications demonstrate the beneﬁts of the proposed method over existing alternatives1 . 1

6 0.062784061 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

7 0.060700025 190 nips-2009-Polynomial Semantic Indexing

8 0.058461599 71 nips-2009-Distribution-Calibrated Hierarchical Classification

9 0.057183322 112 nips-2009-Human Rademacher Complexity

10 0.055132098 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

11 0.05373919 260 nips-2009-Zero-shot Learning with Semantic Output Codes

12 0.053404462 47 nips-2009-Boosting with Spatial Regularization

13 0.053264853 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

14 0.052982278 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

15 0.052085813 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

16 0.044076968 122 nips-2009-Label Selection on Graphs

17 0.043811556 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

18 0.043180898 55 nips-2009-Compressed Least-Squares Regression

19 0.042983189 90 nips-2009-Factor Modeling for Advertisement Targeting

20 0.042652071 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.125), (1, 0.013), (2, -0.053), (3, 0.017), (4, -0.012), (5, -0.032), (6, -0.092), (7, -0.002), (8, -0.075), (9, 0.101), (10, -0.0), (11, 0.034), (12, -0.059), (13, -0.115), (14, 0.071), (15, 0.009), (16, -0.006), (17, 0.031), (18, 0.072), (19, -0.039), (20, 0.065), (21, -0.082), (22, 0.026), (23, -0.042), (24, 0.04), (25, -0.038), (26, -0.036), (27, -0.025), (28, 0.077), (29, -0.005), (30, -0.059), (31, -0.078), (32, 0.063), (33, -0.023), (34, -0.026), (35, -0.053), (36, 0.113), (37, -0.015), (38, -0.056), (39, -0.059), (40, -0.06), (41, -0.011), (42, -0.086), (43, 0.033), (44, 0.012), (45, -0.021), (46, -0.052), (47, 0.047), (48, 0.015), (49, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92364687 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

2 0.63827443 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

3 0.62115943 71 nips-2009-Distribution-Calibrated Hierarchical Classification

Author: Ofer Dekel

Abstract: While many advances have already been made in hierarchical classiﬁcation learning, we take a step back and examine how a hierarchical classiﬁcation problem should be formally deﬁned. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classiﬁcation loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated deﬁnition of statistical risk for hierarchical classiﬁcation, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classiﬁcation to cost-sensitive classiﬁcation.

4 0.52778035 193 nips-2009-Potential-Based Agnostic Boosting

Author: Varun Kanade, Adam Kalai

Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1

5 0.51743591 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufﬁcient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classiﬁcation boundary and the underlying distribution are smooth to a ﬁnite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are inﬁnitely smooth, the improvement is exponential.

6 0.51347071 260 nips-2009-Zero-shot Learning with Semantic Output Codes

7 0.51322174 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

8 0.50752133 112 nips-2009-Human Rademacher Complexity

9 0.48590809 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

10 0.46395451 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine

11 0.44869548 47 nips-2009-Boosting with Spatial Regularization

12 0.44704488 72 nips-2009-Distribution Matching for Transduction

13 0.4272323 94 nips-2009-Fast Learning from Non-i.i.d. Observations

14 0.42513195 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

15 0.41487753 149 nips-2009-Maximin affinity learning of image segmentation

16 0.41295108 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

17 0.41061103 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

18 0.38849238 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

19 0.38520798 237 nips-2009-Subject independent EEG-based BCI decoding

20 0.3836987 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(22, 0.01), (24, 0.055), (25, 0.052), (35, 0.033), (36, 0.086), (39, 0.038), (58, 0.044), (61, 0.015), (71, 0.445), (86, 0.086), (91, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9824813 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models

Author: Laura Dietz, Valentin Dallmeier, Andreas Zeller, Tobias Scheffer

Abstract: We devise a graphical model that supports the process of debugging software by guiding developers to code that is likely to contain defects. The model is trained using execution traces of passing test runs; it reﬂects the distribution over transitional patterns of code positions. Given a failing test case, the model determines the least likely transitional pattern in the execution trace. The model is designed such that Bayesian inference has a closed-form solution. We evaluate the Bernoulli graph model on data of the software projects AspectJ and Rhino. 1

2 0.98135394 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

Author: Richard Socher, Samuel Gershman, Per Sederberg, Kenneth Norman, Adler J. Perotte, David M. Blei

Abstract: We develop a probabilistic model of human memory performance in free recall experiments. In these experiments, a subject ﬁrst studies a list of words and then tries to recall them. To model these data, we draw on both previous psychological research and statistical topic models of text documents. We assume that memories are formed by assimilating the semantic meaning of studied words (represented as a distribution over topics) into a slowly changing latent context (represented in the same space). During recall, this context is reinstated and used as a cue for retrieving studied words. By conceptualizing memory retrieval as a dynamic latent variable model, we are able to use Bayesian inference to represent uncertainty and reason about the cognitive processes underlying memory. We present a particle ﬁlter algorithm for performing approximate posterior inference, and evaluate our model on the prediction of recalled words in experimental data. By specifying the model hierarchically, we are also able to capture inter-subject variability. 1

3 0.97901827 53 nips-2009-Complexity of Decentralized Control: Special Cases

Author: Martin Allen, Shlomo Zilberstein

Abstract: The worst-case complexity of general decentralized POMDPs, which are equivalent to partially observable stochastic games (POSGs) is very high, both for the cooperative and competitive cases. Some reductions in complexity have been achieved by exploiting independence relations in some models. We show that these results are somewhat limited: when these independence assumptions are relaxed in very small ways, complexity returns to that of the general case. 1

4 0.95415032 11 nips-2009-A General Projection Property for Distribution Families

Author: Yao-liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvári

Abstract: Surjectivity of linear projections between distribution families with ﬁxed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classiﬁcation, optimization, portfolio selection and Markov decision processes. 1

same-paper 5 0.89056551 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

6 0.88647532 56 nips-2009-Conditional Neural Fields

7 0.80968744 205 nips-2009-Rethinking LDA: Why Priors Matter

8 0.79831946 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

9 0.72641945 204 nips-2009-Replicated Softmax: an Undirected Topic Model

10 0.71280593 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

11 0.69766486 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

12 0.69472671 206 nips-2009-Riffled Independence for Ranked Data

13 0.69374907 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

14 0.68917495 96 nips-2009-Filtering Abstract Senses From Image Search Results

15 0.67514199 226 nips-2009-Spatial Normalized Gamma Processes

16 0.67438012 260 nips-2009-Zero-shot Learning with Semantic Output Codes

17 0.66422057 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory

18 0.65930754 154 nips-2009-Modeling the spacing effect in sequential category learning

19 0.64826292 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

20 0.64728373 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference