nips nips2007 nips2007-88 knowledge-graph by maker-knowledge-mining

88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

Source: pdf

Author: Maryam Mahdaviani, Tanzeem Choudhury

Abstract: We present a new and efﬁcient semi-supervised training method for parameter estimation and feature selection in conditional random ﬁelds (CRFs). In real-world applications such as activity recognition, unlabeled sensor traces are relatively easy to obtain whereas labeled examples are expensive and tedious to collect. Furthermore, the ability to automatically select a small subset of discriminatory features from a large pool can be advantageous in terms of computational speed as well as accuracy. In this paper, we introduce the semi-supervised virtual evidence boosting (sVEB) algorithm for training CRFs – a semi-supervised extension to the recently developed virtual evidence boosting (VEB) method for feature selection and parameter learning. The objective function of sVEB combines the unlabeled conditional entropy with labeled conditional pseudo-likelihood. It reduces the overall system cost as well as the human labeling cost required during training, which are both important considerations in building real-world inference systems. Experiments on synthetic data and real activity traces collected from wearable sensors, illustrate that sVEB beneﬁts from both the use of unlabeled data and automatic feature selection, and outperforms other semi-supervised approaches. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In real-world applications such as activity recognition, unlabeled sensor traces are relatively easy to obtain whereas labeled examples are expensive and tedious to collect. [sent-2, score-0.294]

2 In this paper, we introduce the semi-supervised virtual evidence boosting (sVEB) algorithm for training CRFs – a semi-supervised extension to the recently developed virtual evidence boosting (VEB) method for feature selection and parameter learning. [sent-4, score-0.612]

3 The objective function of sVEB combines the unlabeled conditional entropy with labeled conditional pseudo-likelihood. [sent-5, score-0.33]

4 It reduces the overall system cost as well as the human labeling cost required during training, which are both important considerations in building real-world inference systems. [sent-6, score-0.073]

5 Experiments on synthetic data and real activity traces collected from wearable sensors, illustrate that sVEB beneﬁts from both the use of unlabeled data and automatic feature selection, and outperforms other semi-supervised approaches. [sent-7, score-0.309]

6 The ability to select the most informative features as needed can reduce the training time and the risk of over-ﬁtting of parameters. [sent-10, score-0.078]

7 Furthermore, in complex modeling tasks, obtaining the large amount of labeled data necessary for training can be impractical. [sent-11, score-0.11]

8 On the other hand, large unlabeled datasets are often easy to obtain, making semi-supervised learning methods appealing in various real-world applications. [sent-12, score-0.141]

9 The goal of our work is to build an activity recognition system that is not only accurate but also scalable, efﬁcient, and easy to train and deploy. [sent-13, score-0.085]

10 An important application domain for activity recognition technologies is in health-care, especially in supporting elder care, managing cognitive disabilities, and monitoring long-term health. [sent-14, score-0.085]

11 Some of the key challenges faced by current activity inference systems are the amount of human effort spent in labeling and feature engineering and the computational complexity and cost associated with training. [sent-16, score-0.176]

12 In this paper, we introduce a fast and scalable semi-supervised training algorithm for CRFs and evaluate its classiﬁcation performance on extensive real world activity traces gathered using wearable sensors. [sent-18, score-0.188]

13 1 Several supervised techniques have been proposed for feature selection in CRFs. [sent-20, score-0.095]

14 For discrete features, McCallum [2] suggested an efﬁcient method for feature induction by iteratively increasing conditional log-likelihood. [sent-21, score-0.089]

15 Dietterich [3] applied gradient tree boosting to select features in CRFs by combining boosting with parameter estimation for 1D linear-chain models. [sent-22, score-0.199]

16 Boosted random ﬁelds (BRFs) [4] combine boosting and belief propagation for feature selection and parameter estimation for densely connected graphs that have weak pairwise connections. [sent-23, score-0.212]

17 [5] developed a more general version of BRFs, called virtual evidence boosting (VEB) that does not make any assumptions about graph connectivity or the strength of pairwise connections. [sent-26, score-0.249]

18 The objective function in VEB is a soft version of maximum pseudo-likelihood (MPL), where the goal is to maximize the sum of local log-likelihoods given soft evidence from its neighbors. [sent-27, score-0.129]

19 This objective function is similar to that used in boosting, which makes it suitable for uniﬁed feature selection and parameter estimation. [sent-28, score-0.089]

20 This approximation applies to any CRF structures and leads to a signiﬁcant reduction in training complexity and time. [sent-29, score-0.058]

21 However, it is not straight forward to incorporate unlabeled data in discriminative models using the traditional conditional likelihood criteria. [sent-31, score-0.165]

22 More recently, Grandvalet and Bengio [9] proposed a minimum entropy regularization framework for incorporating unlabeled data. [sent-33, score-0.173]

23 [10] used this framework and proposed an objective function that combines the conditional likelihood of the labeled data with the conditional entropy of the unlabeled data to train 1D CRFs, which was extended to 2D lattice structures by Lee et. [sent-36, score-0.347]

24 In our work, we combine the minimum entropy regularization framework for incorporating unlabeled data with VEB for training CRFs. [sent-39, score-0.218]

25 An alternative to approximating the conditional likelihood is to change the objective function. [sent-45, score-0.073]

26 For MPL the CRF is cut into a set of independent patches; each patch consists of a hidden node or class label yi , the true value of its direct neighbors and the observations, i. [sent-47, score-0.175]

27 1 When a prior is used in the maximum likelihood objective function as a regularizer – the second term in eq. [sent-51, score-0.056]

28 1 Virtual evidence boosting By extending the standard LogitBoost algorithm [16], VEB integrates boosting based feature selection into CRF training. [sent-54, score-0.288]

29 The objective function used in VEB is very similar to MPL, except that VEB uses the messages from the neighboring nodes as virtual evidence instead of using the true labels of neighbors. [sent-55, score-0.237]

30 The use of virtual evidence helps to reduce over-estimation of neighborhood dependencies. [sent-56, score-0.166]

31 5 where wi = p(yi |vei )(1 − p(yi |vei )), zi = p(yi |vei ) ft (vei ) = arg min wi E(f (vei ) − zi )2 = arg min[ f (3) (4) The wi and zi in equation 4 are the boosting weight and working response respectively for the ith data point, exactly as in LogitBoost. [sent-61, score-0.69]

32 3) involves N X points because of virtual evidence as opposed to N points in LogitBoost. [sent-63, score-0.166]

33 yi ∈ {0, 1}), it is easily extendible to the multi-class case and we have done that in our experiments. [sent-67, score-0.175]

34 At each iteration, vei is updated as messages from n(yi ) changes with the addition of new features. [sent-68, score-0.703]

35 We run belief propagation (BP) to obtain the virtual evidence before each iteration. [sent-69, score-0.199]

36 The CRF feature weights, θ’s are computed by solving the WLSE problem, where the local features, nki is the count of feature k in data instance i and the compatibility features, nki is the virtual evidence from the neighbors. [sent-70, score-0.364]

37 2 Semi-supervised training For semi-supervised training of CRFs, Jiao et. [sent-73, score-0.09]

38 [10] have proposed an algorithm that utilizes unlabeled data via entropy regularization – an extension of the approach proposed by [9] to structured CRF models. [sent-75, score-0.159]

39 3 Semi-supervised virtual evidence boosting In this work, we develop semi-supervised virtual evidence boosting (sVEB) that combines feature selection with semi-supervised training of CRFs. [sent-79, score-0.626]

40 sVEB extends the VEB framework to take advantage of unlabeled data via minimum entropy regularization similar to [9, 10, 11]. [sent-80, score-0.159]

41 The α is a tuning parameter for controlling how much inﬂuence the unlabeled data will have. [sent-82, score-0.132]

42 By considering the soft pseudo-likelihood in LsV EB and using BP to estimate p(yi |vei ), sVEB can use boosting to learn the parameters of CRFs. [sent-83, score-0.111]

43 The virtual evidence from the neighboring nodes captures the label dependencies. [sent-84, score-0.198]

44 In other words, sVEB solves the following weighted least-square error (WLSE) problem N M to learn ft s: ft = arg min[ wi p(yi |vei )(f (xi ) − zi )2 + wi p(yi |vei )(f (xi ) − zi )2 ] (7) f i=1 vei i=N +1 yi vei For labeled data (ﬁrst term in eq. [sent-89, score-2.122]

45 7), boosting weights, wi ’s, and working responses, zi ’s, are computed as described in equation 4. [sent-90, score-0.265]

46 But for the case of unlabeled data the expression for wi and zi becomes more complicated because of the entropy term. [sent-91, score-0.327]

47 We present the equations for wi and zi below, please refer to the Appendix for the derivations: wi = α2 (1 − p(yi |vei ))[p(yi |vei )(1 − p(yi |vei )) + log p(yi |vei )] (yi − 0. [sent-92, score-0.301]

48 5)p(yi |vei )(1 − log p(yi |vei )) (8) α[p(yi |vei )(1 − p(yi |vei )) + log p(yi |vei )] The soft evidence corresponding to messages from the neighboring nodes is obtained by running BP on the entire training dataset (labeled and unlabeled). [sent-93, score-0.25]

49 The CRF feature weights θk s are computed zi = by solving the WLSE problem (e. [sent-94, score-0.125]

50 (7)), θk = M i=1 yi M wi zi nki / i=1 yi wi nki Algorithm 1 gives the pseudo-code for sVEB. [sent-96, score-0.736]

51 The main difference between VEB and sVEB are steps 7 − 10, where we compute wi ’s and zi ’s for all possible values of yi based on the virtual evidence and observations of unlabeled training cases. [sent-97, score-0.689]

52 The boosting weights and working responses are computed using equation (8). [sent-98, score-0.111]

53 4 Experiments We conduct two sets of experiments to evaluate the performance of the sVEB method for training CRFs and the advantage of performing feature selection as part of semi-supervised training. [sent-104, score-0.114]

54 In the ﬁrst set of experiments, we analyze how much the complexity of the underlying CRF and the tuning parameter α effect the performance using synthetic data. [sent-105, score-0.05]

55 In the second set of experiments, we evaluate the beneﬁt of feature selection and using unlabeled data on two real-world activity datasets. [sent-106, score-0.249]

56 We compare the performance of the semi-supervised virtual evidence boosting(sVEB) presented in this paper to the semi-supervised maximum likelihood (sML) method [10]. [sent-107, score-0.183]

57 In addition, for the activity datasets, we also evaluate an alternative approach (sML+Boost), where a subset of features is selected in advance using boosting. [sent-108, score-0.101]

58 , M and yi = 0, 1 do Compute likelihood p(yi |vei ); Compute wi and zi using equation (8) end Obtain “best” weak learner ft according to equation (7) and update Ft = Ft−1 + ft ; end (a) 0. [sent-112, score-0.615]

59 method using all observed features(ML), (ML+Boost) using a subset of features selected in advance, and virtual evidence boosting (VEB). [sent-131, score-0.282]

60 We randomly choose 50% of them as the labeled and the other 50% as unlabeled training data. [sent-138, score-0.222]

61 Figure (1a) shows the average accuracy for the two semi-supervised training methods and their conﬁdence intervals. [sent-143, score-0.073]

62 Given the same amount of training data, sVEB is less likely to overﬁt because of the feature selection step. [sent-149, score-0.114]

63 6 Table 1: Accuracy ± 95% conﬁdence interval of the supervised algorithms on activity datasets 1 and 2 4. [sent-191, score-0.123]

64 2 Activity dataset We collected two activity datasets using wearable sensors, which include audio, acceleration, light, temperature, pressure, and humidity. [sent-192, score-0.159]

65 5 hours of data that is manually labeled for training and testing purposes. [sent-197, score-0.128]

66 For each chunk, we compute 651 features, which include signal energy in log and linear frequency bands, autocorrelation, different entropy measures, mean, variances etc. [sent-200, score-0.068]

67 The features are chosen based on what is used in existing activity recognition literature and a few additional ones that we felt could be useful. [sent-201, score-0.118]

68 We recorded 15 hours of sensor traces over 12 days. [sent-206, score-0.067]

69 As this set contains longer time-scale activities, the data is segmented into 1 minute chunks and 321 different features are computed, similar to the ﬁrst dataset. [sent-207, score-0.062]

70 We evaluate the performance of supervised and semi-supervised training algorithms on these two datasets. [sent-210, score-0.071]

71 For the semi-supervised case, we randomly select 40% of the sequences for a given person or a given day as labeled and a different subset as the unlabeled training data. [sent-211, score-0.24]

72 We compare the performance of sML and sVEB as we incorporate more unlabeled data (20%, 40% and 60%) into the training process. [sent-212, score-0.157]

73 We also compare the supervised techniques, ML, ML+Boost, and VEB, with increasing amount of labeled data. [sent-213, score-0.091]

74 We perform leave-one-person-out cross-validation on dataset 1 and leave-one-day-out cross-validation on dataset 2 and report the average the accuracies. [sent-216, score-0.062]

75 through the boosting iterations) is set to 50 for both datasets – including more features did not signiﬁcantly improve the classiﬁcation performance. [sent-219, score-0.145]

76 For both datasets, incorporating more unlabeled data improves accuracy. [sent-220, score-0.126]

77 Whereas parameter estimation and feature selection via sVEB consistently results in the highest accuracy. [sent-223, score-0.069]

78 The (sML+Boost) method performs better than sML but does not perform as well as when feature selection and parameter estimation is done within a uniﬁed framework as in sVEB. [sent-224, score-0.069]

79 The results of supervised learn- Un- Average Accuracy (%) - Dataset 1 Un- Average Accuracy (%) - Dataset 2 labeled sML+all obs sML+Boost sVEB labeled sML+all obs sML+Boost sVEB 20% 60. [sent-226, score-0.264]

80 7 Table 2: Accuracy ± 95% conﬁdence interval of semi-supervised algorithms on activity datasets 1 and 2 6 Labeled Average Accuracy (%) - Dataset 2 Labeled Average Accuracy (%) - Dataset 2 ML+all obs ML+Boost VEB ML+all obs ML+Boost VEB 5% 59. [sent-262, score-0.205]

81 4 Table 3: Accuracy ± 95% conﬁdence interval of semi-supervised algorithms on activity datasets 1 and 2 ing algorithms are presented in Table 1. [sent-286, score-0.097]

82 The accuracy increases if we incorporate more labeled data during training. [sent-288, score-0.093]

83 To evaluate sVEB when a small amount of labeled data is available, we performed another set of experiments on datasets 1 and 2, where only 5% and 20% of the training data is labeled respectively. [sent-289, score-0.204]

84 We used all the available unlabeled data during training. [sent-290, score-0.112]

85 These experiments clearly demonstrate that although adding more unlabeled data is not as helpful as incorporating more labeled data, the use of cheap unlabeled data along with feature selection can signiﬁcantly boost the performance of the models. [sent-292, score-0.458]

86 For each training iteration in sML the cost of running BP is O(cl ns2 + cu n2 s3 ) [10] whereas the cost of each boosting iteration in sVEB is O((cl + cu )ns2 ). [sent-295, score-0.216]

87 An efﬁcient entropy gradient computation is proposed in [17], which reduces the cost of sML to O((cl + cu )ns2 ) but still requires an optimizer to maximize the log-likelihood. [sent-296, score-0.104]

88 Moreover, the number of training iterations needed is usually much higher than the number of boosting iterations because optimizers such as L-BFGS require many more iterations to reach convergence in high dimensional spaces. [sent-297, score-0.141]

89 Table 4 shows the time for performing the experiments on activity datasets (as described in the previous section) 2 . [sent-299, score-0.097]

90 VEB and sVEB have a lower space cost of O(ns2 Db ), because of the feature selection step Db D usually. [sent-302, score-0.084]

91 n cl cu s D, Db length of training sequence number of labeled training sequences ML ML+Boost VEB sML sML+Boost sVEB number of unlabeled training sequences Dataset 1 34 18 2. [sent-304, score-0.401]

92 Time (hours) 5 Conclusion We presented sVEB, a new semi-supervised training method for CRFs, that can simultaneously select discriminative features via modiﬁed LogitBoost and utilize unlabeled data via minimumentropy regularization. [sent-311, score-0.19]

93 Our experimental results demonstrate the sVEB signiﬁcantly outperforms other training techniques in real-world activity recognition problems. [sent-312, score-0.144]

94 The uniﬁed framework for feature selection and semi-supervised training presented in this paper reduces the computational and human labeling costs, which are often the major bottlenecks in building large classiﬁcation systems. [sent-313, score-0.157]

95 Text classiﬁcation from labeled and unlabeled documents using em. [sent-355, score-0.177]

96 Efﬁcient computation of entropy gradient for semi-supervised conditional random ﬁelds. [sent-421, score-0.083]

97 6 Appendix In this section, we show how we derived the equations for wi and zi (eq. [sent-423, score-0.168]

98 8): LF = LsV EB = LV EB − αHemp = N P log p(yi |vei ) + α i=1 M P P i=N +1 y i p(yi |vei ) log p(yi |vei ) As in LogitBoost, the likelihood function LF is maximized by learning an ensemble of weak learners. [sent-424, score-0.086]

99 We start with an empty ensemble F = 0 and iteratively add the next best weak learner, ft , by computing the Newton s update H , where s and H are the ﬁrst and second derivative respectively of LF with respect to f (vei , yi ). [sent-425, score-0.307]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vei', 0.684), ('sveb', 0.404), ('veb', 0.301), ('sml', 0.247), ('yi', 0.175), ('crfs', 0.138), ('virtual', 0.113), ('unlabeled', 0.112), ('wi', 0.094), ('ml', 0.092), ('crf', 0.09), ('ft', 0.089), ('boost', 0.086), ('boosting', 0.083), ('zi', 0.074), ('activity', 0.068), ('labeled', 0.065), ('nki', 0.062), ('wlse', 0.062), ('eb', 0.054), ('obs', 0.054), ('evidence', 0.053), ('mpl', 0.052), ('entropy', 0.047), ('training', 0.045), ('fk', 0.043), ('jiao', 0.041), ('lsv', 0.041), ('lf', 0.038), ('feature', 0.037), ('logitboost', 0.036), ('conditional', 0.036), ('features', 0.033), ('selection', 0.032), ('byi', 0.031), ('wearable', 0.031), ('dataset', 0.031), ('labeling', 0.03), ('bp', 0.03), ('traces', 0.03), ('cu', 0.029), ('datasets', 0.029), ('soft', 0.028), ('accuracy', 0.028), ('weak', 0.027), ('supervised', 0.026), ('elds', 0.025), ('cl', 0.024), ('observations', 0.023), ('liao', 0.023), ('learner', 0.022), ('log', 0.021), ('brfs', 0.021), ('choudhury', 0.021), ('objective', 0.02), ('tuning', 0.02), ('db', 0.02), ('xi', 0.02), ('regularizer', 0.019), ('sensor', 0.019), ('messages', 0.019), ('please', 0.018), ('hours', 0.018), ('sequences', 0.018), ('evidences', 0.018), ('lv', 0.018), ('yu', 0.018), ('neighboring', 0.018), ('recognition', 0.017), ('likelihood', 0.017), ('belief', 0.017), ('synthetic', 0.017), ('activities', 0.016), ('mccallum', 0.016), ('propagation', 0.016), ('iteratively', 0.016), ('xu', 0.015), ('chunks', 0.015), ('grandvalet', 0.015), ('dietterich', 0.015), ('cost', 0.015), ('outperforms', 0.014), ('dence', 0.014), ('scalable', 0.014), ('incorporating', 0.014), ('fed', 0.014), ('segmented', 0.014), ('yl', 0.014), ('indicator', 0.014), ('combines', 0.014), ('weights', 0.014), ('equation', 0.014), ('nodes', 0.014), ('uni', 0.013), ('lee', 0.013), ('optimizer', 0.013), ('optimizers', 0.013), ('human', 0.013), ('ve', 0.013), ('complexity', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

Author: Maryam Mahdaviani, Tanzeem Choudhury

2 0.14561634 166 nips-2007-Regularized Boost for Semi-Supervised Learning

Author: Ke Chen, Shihai Wang

Abstract: Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work. 1

3 0.104872 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

Author: Bill Triggs, Jakob J. Verbeek

Abstract: Conditional Random Fields (CRFs) are an effective tool for a variety of different data segmentation and labeling tasks including visual scene interpretation, which seeks to partition images into their constituent semantic-level regions and assign appropriate class labels to each region. For accurate labeling it is important to capture the global context of the image as well as local information. We introduce a CRF based scene labeling model that incorporates both local features and features aggregated over the whole image or large sections of it. Secondly, traditional CRF learning requires fully labeled datasets which can be costly and troublesome to produce. We introduce a method for learning CRFs from datasets with many unlabeled nodes by marginalizing out the unknown labels so that the log-likelihood of the known ones can be maximized by gradient ascent. Loopy Belief Propagation is used to approximate the marginals needed for the gradient and log-likelihood calculations and the Bethe free-energy approximation to the log-likelihood is monitored to control the step size. Our experimental results show that effective models can be learned from fragmentary labelings and that incorporating top-down aggregate features signiﬁcantly improves the segmentations. The resulting segmentations are compared to the state-of-the-art on three different image datasets. 1

4 0.084085897 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classiﬁer. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classiﬁcation performance of the target classiﬁer, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

5 0.072294176 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our ﬁrst result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

6 0.069366306 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

7 0.066632204 32 nips-2007-Bayesian Co-Training

8 0.061244395 187 nips-2007-Structured Learning with Approximate Inference

9 0.059561186 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

10 0.057953861 62 nips-2007-Convex Learning with Invariances

11 0.057824481 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

12 0.057397924 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

13 0.05499718 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

14 0.053364355 175 nips-2007-Semi-Supervised Multitask Learning

15 0.050880454 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

16 0.047645688 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

17 0.047351606 97 nips-2007-Hidden Common Cause Relations in Relational Learning

18 0.044799022 21 nips-2007-Adaptive Online Gradient Descent

19 0.044004194 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

20 0.04320962 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.134), (1, 0.033), (2, -0.075), (3, 0.069), (4, 0.022), (5, 0.053), (6, 0.061), (7, -0.072), (8, 0.113), (9, -0.015), (10, 0.029), (11, 0.014), (12, -0.102), (13, -0.015), (14, 0.082), (15, 0.093), (16, -0.017), (17, 0.006), (18, 0.102), (19, 0.08), (20, 0.02), (21, -0.023), (22, 0.069), (23, -0.011), (24, -0.033), (25, 0.073), (26, 0.007), (27, -0.019), (28, -0.033), (29, -0.073), (30, 0.049), (31, 0.072), (32, 0.042), (33, -0.042), (34, -0.107), (35, -0.058), (36, -0.047), (37, 0.021), (38, 0.05), (39, -0.001), (40, -0.043), (41, 0.104), (42, -0.035), (43, -0.028), (44, -0.143), (45, 0.045), (46, -0.064), (47, 0.091), (48, -0.137), (49, -0.115)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92883211 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

Author: Maryam Mahdaviani, Tanzeem Choudhury

2 0.69432771 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

Author: Bill Triggs, Jakob J. Verbeek

3 0.6595751 166 nips-2007-Regularized Boost for Semi-Supervised Learning

Author: Ke Chen, Shihai Wang

4 0.61559844 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received signiﬁcant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identiﬁable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identiﬁable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identiﬁable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

5 0.47510749 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random ﬁelds (CRFs). Using an implicit differentiation trick, we derive an efﬁcient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we ﬁnd that multiple hyperparameter learning can provide a signiﬁcant boost in accuracy compared to using only a single regularization hyperparameter. 1

6 0.46530518 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

7 0.45030621 69 nips-2007-Discriminative Batch Mode Active Learning

8 0.43768442 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

9 0.42549321 187 nips-2007-Structured Learning with Approximate Inference

10 0.42380664 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

11 0.39960942 62 nips-2007-Convex Learning with Invariances

12 0.38817939 32 nips-2007-Bayesian Co-Training

13 0.37968576 97 nips-2007-Hidden Common Cause Relations in Relational Learning

14 0.36759919 175 nips-2007-Semi-Supervised Multitask Learning

15 0.3514182 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

16 0.30837801 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

17 0.3054246 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

18 0.2896722 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

19 0.28745437 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index

20 0.28522721 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.04), (7, 0.29), (13, 0.06), (16, 0.027), (18, 0.019), (19, 0.02), (21, 0.087), (31, 0.025), (34, 0.031), (35, 0.035), (47, 0.065), (83, 0.087), (85, 0.015), (87, 0.018), (88, 0.019), (90, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81737536 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration

Author: Ulrik Beierholm, Ladan Shams, Wei J. Ma, Konrad Koerding

Abstract: Bayesian models of multisensory perception traditionally address the problem of estimating an underlying variable that is assumed to be the cause of the two sensory signals. The brain, however, has to solve a more general problem: it also has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We ﬁrst compare these models on a formal level. Furthermore, we conduct a psychophysics experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We ﬁnd that the causal Bayesian inference model accounts for the data better than other models. Keywords: causal inference, Bayesian methods, visual perception. 1 Multisensory perception In the ventriloquist illusion, a performer speaks without moving his/her mouth while moving a puppet’s mouth in synchrony with his/her speech. This makes the puppet appear to be speaking. This illusion was ﬁrst conceptualized as ”visual capture”, occurring when visual and auditory stimuli exhibit a small conﬂict ([1, 2]). Only recently has it been demonstrated that the phenomenon may be seen as a byproduct of a much more ﬂexible and nearly Bayes-optimal strategy ([3]), and therefore is part of a large collection of cue combination experiments showing such statistical near-optimality [4, 5]. In fact, cue combination has become the poster child for Bayesian inference in the nervous system. In previous studies of multisensory integration, two sensory stimuli are presented which act as cues about a single underlying source. For instance, in the auditory-visual localization experiment by Alais and Burr [3], observers were asked to envisage each presentation of a light blob and a sound click as a single event, like a ball hitting the screen. In many cases, however, the brain is not only posed with the problem of identifying the position of a common source, but also of determining whether there was a common source at all. In the on-stage ventriloquist illusion, it is indeed primarily the causal inference process that is being fooled, because veridical perception would attribute independent causes to the auditory and the visual stimulus. 1 To extend our understanding of multisensory perception to this more general problem, it is necessary to manipulate the degree of belief assigned to there being a common cause within a multisensory task. Intuitively, we expect that when two signals are very different, they are less likely to be perceived as having a common source. It is well-known that increasing the discrepancy or inconsistency between stimuli reduces the inﬂuence that they have on each other [6, 7, 8, 9, 10, 11]. In auditoryvisual spatial localization, one variable that controls stimulus similarity is spatial disparity (another would be temporal disparity). Indeed, it has been reported that increasing spatial disparity leads to a decrease in auditory localization bias [1, 12, 13, 14, 15, 16, 17, 2, 18, 19, 20, 21]. This decrease also correlates with a decrease in the reports of unity [19, 21]. Despite the abundance of experimental data on this issue, no general theory exists that can explain multisensory perception across a wide range of cue conﬂicts. 2 Models The success of Bayesian models for cue integration has motivated attempts to extend them to situations of large sensory conﬂict and a consequent low degree of integration. In one of recent studies taking this approach, subjects were presented with concurrent visual ﬂashes and auditory beeps and asked to count both the number of ﬂashes and the number of beeps [11]. The advantage of the experimental paradigm adopted here was that it probed the joint response distribution by requiring a dual report. Human data were accounted for well by a Bayesian model in which the joint prior distribution over visual and auditory number was approximated from the data. In a similar study, subjects were presented with concurrent ﬂashes and taps and asked to count either the ﬂashes or the taps [9, 22]. The Bayesian model proposed by these authors assumed a joint prior distribution with a near-diagonal form. The corresponding generative model assumes that the sensory sources somehow interact with one another. A third experiment modulated the rates of ﬂashes and beeps. The task was to judge either the visual or the auditory modulation rate relative to a standard [23]. The data from this experiment were modeled using a joint prior distribution which is the sum of a near-diagonal prior and a ﬂat background. While all these models are Bayesian in a formal sense, their underlying generative model does not formalize the model selection process that underlies the combination of cues. This makes it necessary to either estimate an empirical prior [11] by ﬁtting it to human behavior or to assume an ad hoc form [22, 23]. However, we believe that such assumptions are not needed. It was shown recently that human judgments of spatial unity in an auditory-visual spatial localization task can be described using a Bayesian inference model that infers causal structure [24, 25]. In this model, the brain does not only estimate a stimulus variable, but also infers the probability that the two stimuli have a common cause. In this paper we compare these different models on a large data set of human position estimates in an auditory-visual task. In this section we ﬁrst describe the traditional cue integration model, then the recent models based on joint stimulus priors, and ﬁnally the causal inference model. To relate to the experiment in the next section, we will use the terminology of auditory-visual spatial localization, but the formalism is very general. 2.1 Traditional cue integration The traditional generative model of cue integration [26] has a single source location s which produces on each trial an internal representation (cue) of visual location, xV and one of auditory location, xA . We assume that the noise processes by which these internal representations are generated are conditionally independent from each other and follow Gaussian distributions. That is, p (xV |s) ∼ N (xV ; s, σV )and p (xA |s) ∼ N (xA ; s, σA ), where N (x; µ, σ) stands for the normal distribution over x with mean µ and standard deviation σ. If on a given trial the internal representations are xV and xA , the probability that their source was s is given by Bayes’ rule, p (s|xV , xA ) ∝ p (xV |s) p (xA |s) . If a subject performs maximum-likelihood estimation, then the estimate will be xV +wA s = wV wV +wA xA , where wV = σ1 and wA = σ1 . It is important to keep in mind that this is the ˆ 2 2 V A estimate on a single trial. A psychophysical experimenter can never have access to xV and xA , which 2 are the noisy internal representations. Instead, an experimenter will want to collect estimates over many trials and is interested in the distribution of s given sV and sA , which are the sources generated ˆ by the experimenter. In a typical cue combination experiment, xV and xA are not actually generated by the same source, but by different sources, a visual one sV and an auditory one sA . These sources are chosen close to each other so that the subject can imagine that the resulting cues originate from a single source and thus implicitly have a common cause. The experimentally observed distribution is then p (ˆ|sV , sA ) = s p (ˆ|xV , xA ) p (xV |sV ) p (xA |sA ) dxV dxA s Given that s is a linear combination of two normally distributed variables, it will itself follow a ˆ sV +wA 1 2 normal distribution, with mean s = wVwV +wA sA and variance σs = wV +wA . The reason that we ˆ ˆ emphasize this point is because many authors identify the estimate distribution p (ˆ|sV , sA ) with s the posterior distribution p (s|xV , xA ). This is justiﬁed in this case because all distributions are Gaussian and the estimate is a linear combination of cues. However, in the case of causal inference, these conditions are violated and the estimate distribution will in general not be the same as the posterior distribution. 2.2 Models with bisensory stimulus priors Models with bisensory stimulus priors propose the posterior over source positions to be proportional to the product of unimodal likelihoods and a two-dimensional prior: p (sV , sA |xV , xA ) = p (sV , sA ) p (xV |sV ) p (xA |sA ) The traditional cue combination model has p (sV , sA ) = p (sV ) δ (sV − sA ), usually (as above) even with p (sV ) uniform. The question arises what bisensory stimulus prior is appropriate. In [11], the prior is estimated from data, has a large number of parameters, and is therefore limited in its predictive power. In [23], it has the form − (sV −sA )2 p (sV , sA ) ∝ ω + e 2σ 2 coupling while in [22] the additional assumption ω = 0 is made1 . In all three models, the response distribution p (ˆV , sA |sV , sA ) is obtained by idens ˆ tifying it with the posterior distribution p (sV , sA |xV , xA ). This procedure thus implicitly assumes that marginalizing over the latent variables xV and xA is not necessary, which leads to a signiﬁcant error for non-Gaussian priors. In this paper we correctly deal with these issues and in all cases marginalize over the latent variables. The parametric models used for the coupling between the cues lead to an elegant low-dimensional model of cue integration that allows for estimates of single cues that differ from one another. C C=1 SA S XA 2.3 C=2 XV SV XA XV Causal inference model In the causal inference model [24, 25], we start from the traditional cue integration model but remove the assumption that two signals are caused by the same source. Instead, the number of sources can be one or two and is itself a variable that needs to be inferred from the cues. Figure 1: Generative model of causal inference. 1 This family of Bayesian posterior distributions also includes one used to successfully model cue combination in depth perception [27, 28]. In depth perception, however, there is no notion of segregation as always a single surface is assumed. 3 If there are two sources, they are assumed to be independent. Thus, we use the graphical model depicted in Fig. 1. We denote the number of sources by C. The probability distribution over C given internal representations xV and xA is given by Bayes’ rule: p (C|xV , xA ) ∝ p (xV , xA |C) p (C) . In this equation, p (C) is the a priori probability of C. We will denote the probability of a common cause by pcommon , so that p (C = 1) = pcommon and p (C = 2) = 1 − pcommon . The probability of generating xV and xA given C is obtained by inserting a summation over the sources: p (xV , xA |C = 1) = p (xV , xA |s)p (s) ds = p (xV |s) p (xA |s)p (s) ds Here p (s) is a prior for spatial location, which we assume to be distributed as N (s; 0, σP ). Then all three factors in this integral are Gaussians, allowing for an analytic solution: p (xV , xA |C = 1) = 2 2 2 2 2 −xA )2 σP σA √ 2 2 1 2 2 2 2 exp − 1 (xV σ2 σ2 +σ2+xV+σ2+xA σV . 2 σ2 σ2 2π σV σA +σV σP +σA σP V A V P A P For p (xV , xA |C = 2) we realize that xV and xA are independent of each other and thus obtain p (xV , xA |C = 2) = p (xV |sV )p (sV ) dsV p (xA |sA )p (sA ) dsA Again, as all these distributions are assumed to be Gaussian, we obtain an analytic solution, x2 x2 1 1 V A p (xV , xA |C = 2) = exp − 2 σ2 +σ2 + σ2 +σ2 . Now that we have com2 +σ 2 2 +σ 2 p p V A 2π (σV p )(σA p) puted p (C|xV , xA ), the posterior distribution over sources is given by p (si |xV , xA ) = p (si |xV , xA , C) p (C|xV , xA ) C=1,2 where i can be V or A and the posteriors conditioned on C are well-known: p (si |xA , xV , C = 1) = p (xA |si ) p (xV |si ) p (si ) , p (xA |s) p (xV |s) p (s) ds p (si |xA , xV , C = 2) = p (xi |si ) p (si ) p (xi |si ) p (si ) dsi The former is the same as in the case of mandatory integration with a prior, the latter is simply the unimodal posterior in the presence of a prior. Based on the posterior distribution on a given trial, p (si |xV , xA ), an estimate has to be created. For this, we use a sum-squared-error cost func2 2 tion, Cost = p (C = 1|xV , xA ) (ˆ − s) + p (C = 2|xV , xA ) (ˆ − sV or A ) . Then the best s s estimate is the mean of the posterior distribution, for instance for the visual estimation: sV = p (C = 1|xA , xV ) sV,C=1 + p (C = 2|xA , xV ) sV,C=2 ˆ ˆ ˆ where sV,C=1 = ˆ −2 −2 −2 xV σV +xA σA +xP σP −2 −2 −2 σV +σA +σP and sV,C=2 = ˆ −2 −2 xV σV +xP σP . −2 −2 σV +σP If pcommon equals 0 or 1, this estimate reduces to one of the conditioned estimates and is linear in xV and xA . If 0 < pcommon < 1, the estimate is a nonlinear combination of xV and xA , because of the functional form of p (C|xV , xA ). The response distributions, that is the distributions of sV and sA given ˆ ˆ sV and sA over many trials, now cannot be identiﬁed with the posterior distribution on a single trial and cannot be computed analytically either. The correct way to obtain the response distribution is to simulate an experiment numerically. Note that the causal inference model above can also be cast in the form of a bisensory stimulus prior by integrating out the latent variable C, with: p (sA , sV ) = p (C = 1) δ (sA − sV ) p (sA ) + p (sA ) p (sV ) p (C = 2) However, in addition to justifying the form of the interaction between the cues, the causal inference model has the advantage of being based on a generative model that well formalizes salient properties of the world, and it thereby also allows to predict judgments of unity. 4 3 Model performance and comparison To examine the performance of the causal inference model and to compare it to previous models, we performed a human psychophysics experiment in which we adopted the same dual-report paradigm as was used in [11]. Observers were simultaneously presented with a brief visual and also an auditory stimulus, each of which could originate from one of ﬁve locations on an imaginary horizontal line (-10◦ , -5◦ , 0◦ , 5◦ , or 10◦ with respect to the ﬁxation point). Auditory stimuli were 32 ms of white noise ﬁltered through an individually calibrated head related transfer function (HRTF) and presented through a pair of headphones, whereas the visual stimuli were high contrast Gabors on a noisy background presented on a 21-inch CRT monitor. Observers had to report by means of a key press (1-5) the perceived positions of both the visual and the auditory stimulus. Each combination of locations was presented with the same frequency over the course of the experiment. In this way, for each condition, visual and auditory response histograms were obtained. We obtained response distributions for each the three models described above by numeral simulation. On each trial, estimation is followed by a step in which, the key is selected which corresponds to the position closed to the best estimate. The simulated histograms obtained in this way were compared to the measured response frequencies of all subjects by computing the R2 statistic. Auditory response Auditory model Visual response Visual model no vision The parameters in the causal inference model were optimized using fminsearch in MATLAB to maximize R2 . The best combination of parameters yielded an R2 of 0.97. The response frequencies are depicted in Fig. 2. The bisensory prior models also explain most of the variance, with R2 = 0.96 for the Roach model and R2 = 0.91 for the Bresciani model. This shows that it is possible to model cue combination for large disparities well using such models. no audio 1 0 Figure 2: A comparison between subjects’ performance and the causal inference model. The blue line indicates the frequency of subjects responses to visual stimuli, red line is the responses to auditory stimuli. Each set of lines is one set of audio-visual stimulus conditions. Rows of conditions indicate constant visual stimulus, columns is constant audio stimulus. Model predictions is indicated by the red and blue dotted line. 5 3.1 Model comparison To facilitate quantitative comparison with other models, we now ﬁt the parameters of each model2 to individual subject data, maximizing the likelihood of the model, i.e., the probability of the response frequencies under the model. The causal inference model ﬁts human data better than the other models. Compared to the best ﬁt of the causal inference model, the Bresciani model has a maximal log likelihood ratio (base e) of the data of −22 ± 6 (mean ± s.e.m. over subjects), and the Roach model has a maximal log likelihood ratio of the data of −18 ± 6. A causal inference model that maximizes the probability of being correct instead of minimizing the mean squared error has a maximal log likelihood ratio of −18 ± 3. These values are considered decisive evidence in favor of the causal inference model that minimizes the mean squared error (for details, see [25]). The parameter values found in the likelihood optimization of the causal model are as follows: pcommon = 0.28 ± 0.05, σV = 2.14 ± 0.22◦ , σA = 9.2 ± 1.1◦ , σP = 12.3 ± 1.1◦ (mean ± s.e.m. over subjects). We see that there is a relatively low prior probability of a common cause. In this paradigm, auditory localization is considerably less precise than visual localization. Also, there is a weak prior for central locations. 3.2 Localization bias A useful quantity to gain more insight into the structure of multisensory data is the cross-modal bias. In our experiment, relative auditory bias is deﬁned as the difference between the mean auditory estimate in a given condition and the real auditory position, divided by the difference between the real visual position and the real auditory position in this condition. If the inﬂuence of vision on the auditory estimate is strong, then the relative auditory bias will be high (close to one). It is well-known that bias decreases with spatial disparity and our experiment is no exception (solid line in Fig. 3; data were combined between positive and negative disparities). It can easily be shown that a traditional cue integration model would predict a bias equal to σ2 −1 , which would be close to 1 and 1 + σV 2 A independent of disparity, unlike the data. This shows that a mandatory integration model is an insufﬁcient model of multisensory interactions. 45 % Auditory Bias We used the individual subject ﬁttings from above and and averaged the auditory bias values obtained from those ﬁts (i.e. we did not ﬁt the bias data themselves). Fits are shown in Fig. 3 (dashed lines). We applied a paired t-test to the differences between the 5◦ and 20◦ disparity conditions (model-subject comparison). Using a double-sided test, the null hypothesis that the difference between the bias in the 5◦ and 20◦ conditions is correctly predicted by each model is rejected for the Bresciani model (p < 0.002) and the Roach model (p < 0.042) and accepted for the causal inference model (p > 0.17). Alternatively, with a single-sided test, the hypothesis is rejected for the Bresciani model (p < 0.001) and the Roach model (p < 0.021) and accepted for the causal inference model (> 0.9). 50 40 35 30 25 20 5 10 15 Spatial Disparity (deg.) 20 Figure 3: Auditory bias as a function of spatial disparity. Solid blue line: data. Red: Causal inference model. Green: Model by Roach et al. [23]. Purple: Model by Bresciani et al. [22]. Models were optimized on response frequencies (as in Fig. 2), not on the bias data. The reason that the Bresciani model fares worst is that its prior distribution does not include a component that corresponds to independent causes. On 2 The Roach et al. model has four free parameters (ω,σV , σA , σcoupling ), the Bresciani et al. model has three (σV , σA , σcoupling ), and the causal inference model has four (pcommon ,σV , σA , σP ). We do not consider the Shams et al. model here, since it has many more parameters and it is not immediately clear how in this model the erroneous identiﬁcation of posterior with response distribution can be corrected. 6 the contrary, the prior used in the Roach model contains two terms, one term that is independent of the disparity and one term that decreases with increasing disparity. It is thus functionally somewhat similar to the causal inference model. 4 Discussion We have argued that any model of multisensory perception should account not only for situations of small, but also of large conﬂict. In these situations, segregation is more likely, in which the two stimuli are not perceived to have the same cause. Even when segregation occurs, the two stimuli can still inﬂuence each other. We compared three Bayesian models designed to account for situations of large conﬂict by applying them to auditory-visual spatial localization data. We pointed out a common mistake: for nonGaussian bisensory priors without mandatory integration, the response distribution can no longer be identiﬁed with the posterior distribution. After correct implementation of the three models, we found that the causal inference model is superior to the models with ad hoc bisensory priors. This is expected, as the nervous system actually needs to solve the problem of deciding which stimuli have a common cause and which stimuli are unrelated. We have seen that multisensory perception is a suitable tool for studying causal inference. However, the causal inference model also has the potential to quantitatively explain a number of other perceptual phenomena, including perceptual grouping and binding, as well as within-modality cue combination [27, 28]. Causal inference is a universal problem: whenever the brain has multiple pieces of information it must decide if they relate to one another or are independent. As the causal inference model describes how the brain processes probabilistic sensory information, the question arises about the neural basis of these processes. Neural populations encode probability distributions over stimuli through Bayes’ rule, a type of coding known as probabilistic population coding. Recent work has shown how the optimal cue combination assuming a common cause can be implemented in probabilistic population codes through simple linear operations on neural activities [29]. This framework makes essential use of the structure of neural variability and leads to physiological predictions for activity in areas that combine multisensory input, such as the superior colliculus. Computational mechanisms for causal inference are expected have a neural substrate that generalizes these linear operations on population activities. A neural implementation of the causal inference model will open the door to a complete neural theory of multisensory perception. References [1] H.L. Pick, D.H. Warren, and J.C. Hay. Sensory conﬂict in judgements of spatial direction. Percept. Psychophys., 6:203205, 1969. [2] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory ”compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys, 30(6):557– 64, 1981. [3] D. Alais and D. Burr. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3):257–62, 2004. [4] R. A. Jacobs. Optimal integration of texture and motion cues to depth. Vision Res, 39(21):3621–9, 1999. [5] R. J. van Beers, A. C. Sittig, and J. J. Gon. Integration of proprioceptive and visual position-information: An experimentally supported model. J Neurophysiol, 81(3):1355–64, 1999. [6] D. H. Warren and W. T. Cleaves. Visual-proprioceptive interaction under large amounts of conﬂict. J Exp Psychol, 90(2):206–14, 1971. [7] C. E. Jack and W. R. Thurlow. Effects of degree of visual association and angle of displacement on the ”ventriloquism” effect. Percept Mot Skills, 37(3):967–79, 1973. [8] G. H. Recanzone. Auditory inﬂuences on visual temporal rate perception. J Neurophysiol, 89(2):1078–93, 2003. [9] J. P. Bresciani, M. O. Ernst, K. Drewing, G. Bouyer, V. Maury, and A. Kheddar. Feeling what you hear: auditory signals can modulate tactile tap perception. Exp Brain Res, 162(2):172–80, 2005. 7 [10] R. Gepshtein, P. Leiderman, L. Genosar, and D. Huppert. Testing the three step excited state proton transfer model by the effect of an excess proton. J Phys Chem A Mol Spectrosc Kinet Environ Gen Theory, 109(42):9674–84, 2005. [11] L. Shams, W. J. Ma, and U. Beierholm. Sound-induced ﬂash illusion as an optimal percept. Neuroreport, 16(17):1923–7, 2005. [12] G Thomas. Experimental study of the inﬂuence of vision on sound localisation. J Exp Psychol, 28:167177, 1941. [13] W. R. Thurlow and C. E. Jack. Certain determinants of the ”ventriloquism effect”. Percept Mot Skills, 36(3):1171–84, 1973. [14] C.S. Choe, R. B. Welch, R.M. Gilford, and J.F. Juola. The ”ventriloquist effect”: visual dominance or response bias. Perception and Psychophysics, 18:55–60, 1975. [15] R. I. Bermant and R. B. Welch. Effect of degree of separation of visual-auditory stimulus and eye position upon spatial interaction of vision and audition. Percept Mot Skills, 42(43):487–93, 1976. [16] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory discrepancy. Psychol Bull, 88(3):638–67, 1980. [17] P. Bertelson and M. Radeau. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys, 29(6):578–84, 1981. [18] P. Bertelson, F. Pavani, E. Ladavas, J. Vroomen, and B. de Gelder. Ventriloquism in patients with unilateral visual neglect. Neuropsychologia, 38(12):1634–42, 2000. [19] D. A. Slutsky and G. H. Recanzone. Temporal and spatial dependency of the ventriloquism effect. Neuroreport, 12(1):7–10, 2001. [20] J. Lewald, W. H. Ehrenstein, and R. Guski. Spatio-temporal constraints for auditory–visual integration. Behav Brain Res, 121(1-2):69–79, 2001. [21] M. T. Wallace, G. E. Roberson, W. D. Hairston, B. E. Stein, J. W. Vaughan, and J. A. Schirillo. Unifying multisensory signals across time and space. Exp Brain Res, 158(2):252–8, 2004. [22] J. P. Bresciani, F. Dammeier, and M. O. Ernst. Vision and touch are automatically integrated for the perception of sequences of events. J Vis, 6(5):554–64, 2006. [23] N. W. Roach, J. Heron, and P. V. McGraw. Resolving multisensory conﬂict: a strategy for balancing the costs and beneﬁts of audio-visual integration. Proc Biol Sci, 273(1598):2159–68, 2006. [24] K. P. Kording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends Cogn Sci, 2006. 1364-6613 (Print) Journal article. [25] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J. Tenenbaum, and L. Shams. Causal inference in multisensory perception. PLoS ONE, 2(9):e943, 2007. [26] Z. Ghahramani. Computational and psychophysics of sensorimotor integration. PhD thesis, Massachusetts Institute of Technology, 1995. [27] D. C. Knill. Mixture models and the probabilistic structure of depth cues. Vision Res, 43(7):831–54, 2003. [28] D. C. Knill. Robust cue integration: A bayesian model and evidence from cue conﬂict studies with stereoscopic and ﬁgure cues to slant. Journal of Vision, 7(7):2–24. [29] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11):1432–8, 2006. 8

same-paper 2 0.69435793 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

Author: Maryam Mahdaviani, Tanzeem Choudhury

3 0.65197039 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

Author: M. M. Mahmud, Sylvian Ray

Abstract: In transfer learning we aim to solve new problems using fewer examples using information gained from solving related problems. Transfer learning has been successful in practice, and extensive PAC analysis of these methods has been developed. However it is not yet clear how to deﬁne relatedness between tasks. This is considered as a major problem as it is conceptually troubling and it makes it unclear how much information to transfer and when and how to transfer it. In this paper we propose to measure the amount of information one task contains about another using conditional Kolmogorov complexity between the tasks. We show how existing theory neatly solves the problem of measuring relatedness and transferring the ‘right’ amount of information in sequential transfer learning in a Bayesian setting. The theory also suggests that, in a very formal and precise sense, no other reasonable transfer method can do much better than our Kolmogorov Complexity theoretic transfer method, and that sequential transfer is always justiﬁed. We also develop a practical approximation to the method and use it to transfer information between 8 arbitrarily chosen databases from the UCI ML repository. 1

4 0.50412893 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classiﬁcation using indeﬁnite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously ﬁnds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classiﬁcation problem where the indeﬁnite kernel matrix is treated as a noisy observation of the true positive semideﬁnite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efﬁciently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

5 0.47356781 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efﬁcient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

6 0.47309208 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

7 0.47193897 86 nips-2007-Exponential Family Predictive Representations of State

8 0.47035915 69 nips-2007-Discriminative Batch Mode Active Learning

9 0.46993515 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

10 0.46916136 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

11 0.46849576 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

12 0.46764946 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

13 0.46677747 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

14 0.4663114 63 nips-2007-Convex Relaxations of Latent Variable Training

15 0.4661513 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

16 0.46450585 156 nips-2007-Predictive Matrix-Variate t Models

17 0.46376768 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

18 0.4637014 158 nips-2007-Probabilistic Matrix Factorization

19 0.46299434 100 nips-2007-Hippocampal Contributions to Control: The Third Way

20 0.46271896 24 nips-2007-An Analysis of Inference with the Universum