nips nips2009 nips2009-258 knowledge-graph by maker-knowledge-mining

258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

Source: pdf

Author: Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, Paul L. Ruvolo

Abstract: Modern machine learning-based approaches to computer vision require very large databases of hand labeled images. Some contemporary vision systems already require on the order of millions of images for training (e.g., Omron face detector [9]). New Internet-based services allow for a large number of labelers to collaborate around the world at very low cost. However, using these services brings interesting theoretical and practical challenges: (1) The labelers may have wide ranging levels of expertise which are unknown a priori, and in some cases may be adversarial; (2) images may vary in their level of difﬁculty; and (3) multiple labels for the same image must be combined to provide an estimate of the actual label of the image. Probabilistic approaches provide a principled way to approach these problems. In this paper we present a probabilistic model and use it to simultaneously infer the label of each image, the expertise of each labeler, and the difﬁculty of each image. On both simulated and real data, we demonstrate that the model outperforms the commonly used “Majority Vote” heuristic for inferring image labels, and is robust to both noisy and adversarial labelers. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 New Internet-based services allow for a large number of labelers to collaborate around the world at very low cost. [sent-7, score-0.592]

2 In this paper we present a probabilistic model and use it to simultaneously infer the label of each image, the expertise of each labeler, and the difﬁculty of each image. [sent-10, score-0.247]

3 On both simulated and real data, we demonstrate that the model outperforms the commonly used “Majority Vote” heuristic for inferring image labels, and is robust to both noisy and adversarial labelers. [sent-11, score-0.407]

4 While these approaches use clever schemes to obtain data from humans for free, a more direct approach is to hire labelers online. [sent-17, score-0.572]

5 For example, principled methods are needed to combine the labels from multiple experts and to estimate the certainty of the current labels. [sent-20, score-0.294]

6 Which image should be labeled (or relabeled) next must also be decided – it may be prudent, for example, to collect many labels for each image in order to increase one’s conﬁdence in that image’s label. [sent-21, score-0.506]

7 However, if an image is easy and the labelers of that image are reliable, a few labels may be sufﬁcient and valuable resources may be used to label other images. [sent-22, score-1.136]

8 On both simulated and real-life data, we demonstrate that the model outperforms the commonly used “Majority Vote” heuristic for inferring image labels, and is robust to both adversarial and noisy labelers. [sent-26, score-0.407]

9 We wish to determine the class label Zj (0 or 1) of each image j by querying from m labelers. [sent-31, score-0.197]

10 The observed labels depend on several causal factors: (1) the difﬁculty of the image; (2) the expertise of the labeler; and (3) the true label. [sent-32, score-0.383]

11 We model the difﬁculty of image j using the parameter 1/βj ∈ [0, ∞) where βj is constrained to be positive. [sent-33, score-0.122]

12 Here 1/βj = ∞ means the image is very ambiguous and hence even the most proﬁcient labeler has a 50% chance of labeling it correctly. [sent-34, score-0.518]

13 1/βj = 0 means the image is so easy that even the most obtuse labeler will always label it correctly. [sent-35, score-0.531]

14 The expertise of each labeler i is modeled by the parameter αi ∈ (−∞, +∞). [sent-36, score-0.454]

15 Here an α = +∞ means the labeler always labels images correctly; −∞ means the labeler always labels the images incorrectly, i. [sent-37, score-1.304]

16 In this case (αi < 0), the labeler is said to be adversarial. [sent-40, score-0.334]

17 Finally, αi = 0 means that the labeler cannot discriminate between the two classes – his/her labels carry no information about the true image label Zj . [sent-41, score-0.794]

18 Note that we do not require the labelers to be human – labelers can also be, for instance, automatic classiﬁers. [sent-42, score-1.168]

19 Hence, the proposed approach will provide a principled way of combining labels from any combination of human and previously existing machine-based classiﬁers. [sent-43, score-0.27]

20 , log p(Lij = Zj ) = αi βj 1 − p(Lij = Zj ) (2) More skilled labelers (higher αi ) have a higher probability of labeling correctly. [sent-46, score-0.658]

21 As the difﬁculty 1/βj of an image increases, the probability of the label being correct moves toward 0. [sent-47, score-0.213]

22 Similarly, as the labeler’s expertise decreases (lower αi ), the chance of correctness likewise drops to 0. [sent-49, score-0.134]

23 True image labels Zj , labeler accuracy values αi , and image difﬁculty values βj are sampled from a known prior distribution. [sent-53, score-0.848]

24 These determine the observed labels according to Equation 1. [sent-54, score-0.24]

25 Given a set of observed labels l = {lij }, the task is to infer simultaneously the most likely values of Z = {Zj } (the true image labels) as well as the labeler accuracies α = {αi } and the image difﬁculty parameters β = {βj }. [sent-55, score-0.909]

26 3 Inference The observed labels are samples from the {Lij } random variables. [sent-57, score-0.24]

27 The unobserved variables are the true image labels Zj , the different labeler accuracies αi , and the image difﬁculty parameters 1/βj . [sent-58, score-0.859]

28 3 β n True labels Z Z 1 2 3 Z n Observed labels L11 . [sent-65, score-0.45]

29 3 α m Figure 1: Graphical model of image difﬁculties, true image labels, observed labels, and labeler accuracies. [sent-77, score-0.616]

30 Here we can use Expectation- Maximization approach (EM) to obtain maximum likelihood estimates of the parameters of interest (the full derivation is in the Supplementary Materials): E step: Let the set of all given labels for an image j be denoted as lj = {lij | j = j}. [sent-80, score-0.393]

31 Note that not every labeler must label every single image. [sent-81, score-0.409]

32 In this case, the index variable i in lij refers only to those labelers who labeled image j. [sent-82, score-0.908]

33 the posterior probabilities of the Z values computed during the last E step: Q(α, β) = = E [ln p(l, z|α, β)]   p(lij |zj , αi , βj )  p(zj ) E ln j i since lij are cond. [sent-87, score-0.191]

34 These priors may be useful, for example, if we know that most labelers are not adversarial. [sent-93, score-0.572]

35 By “clamping” the Z values (using the prior) for the images on which the 3 true label is known for sure, the model may be able to better estimate the other parameters. [sent-96, score-0.191]

36 The Z values for such images can be clamped by setting the prior probability p(zj ) (used in the E-Step) for these images to be very high towards one particular class. [sent-97, score-0.186]

37 1 Computing each function is linear in the number of images, number of labelers, and total number of image labels. [sent-105, score-0.122]

38 Empirically when using the approach on a database of 1 million images that we recently collected and labeled we found that the EM procedure converged in about 10 minutes using a single core of a Xeon 2. [sent-106, score-0.13]

39 Real time inference may also be possible if we maintain parameters close to the solution that are updated as new labels become available. [sent-109, score-0.242]

40 This would allow using the algorithm in an active manner to choose in real-time which images should be labeled next so as to minimize the uncertainty about the image labels. [sent-110, score-0.252]

41 4 Simulations Here we explore the performance of the model using a set of image labels generated by the model itself. [sent-111, score-0.347]

42 In particular, we simulated between 4 and 20 labelers, each labeling 2000 images, whose true labels Z were either 0 or 1 with equal probability. [sent-113, score-0.339]

43 The accuracy αi of each labeler was drawn from a normal distribution with mean 1 and variance 1. [sent-114, score-0.379]

44 Given these labeler abilities and image difﬁculties, the observed labels lij were sampled according to Equation 1 using Z. [sent-116, score-0.9]

45 As expected, as the number of labelers grows, the parameter estimates converge to the true values. [sent-121, score-0.614]

46 ˆ We also computed the proportion of label estimates Z that matched the true image labels Z. [sent-122, score-0.489]

47 We compared the maximum likelihood estimates of the GLAD model to estimates obtained by taking the majority vote as the predicted label. [sent-123, score-0.351]

48 5 the posterior probability of the label of each image being of class 1 given the accuracy and difﬁculty parameters returned by EM (see Section 3). [sent-125, score-0.256]

49 GLAD makes fewer errors than the majority vote heuristic. [sent-127, score-0.313]

50 The difference between the two approaches is particularly pronounced when the number of labelers per image is small. [sent-128, score-0.694]

51 On many images, GLAD correctly infers the true image label Z even when that Z value was the minority opinion. [sent-129, score-0.235]

52 In essence, GLAD is exploiting the fact that some labelers are experts (which it infers automatically), and hence their votes should count more on these images than the votes of less skilled labelers. [sent-130, score-0.776]

53 Modeling Image Difﬁculty : To explore the importance of estimating image difﬁculty we performed a simple simulation: Image labels (0 or 1) were assigned randomly (with equal probability) to 1000 images. [sent-131, score-0.347]

54 The proportion of “good” to “bad” labelers is 25:1. [sent-134, score-0.597]

55 The probability of correctness for each image difﬁculty and labeler quality combination was given by the table below: 1 The libgsl conjugate gradient descent optimizer we used requires both Q and 4 Q. [sent-135, score-0.47]

56 5 10 15 Number of Labelers Figure 2: Left: The accuracies of the GLAD model versus simple voting for inferring the underlying class labels on simulation data. [sent-147, score-0.309]

57 We compared three approaches: (1) our proposed method, GLAD; (2) the method proposed in [5], which models labeler ability but not image difﬁculty; and (3) Majority Vote. [sent-152, score-0.456]

58 Over the 50 simulation runs, the average percent-correct of the inferred labels was 85. [sent-161, score-0.306]

59 Each of the 100 Greeble images was labeled by 10 different human coders on the Turk for gender (male/female). [sent-167, score-0.222]

60 Four greebles of each gender (separate from the 100 labeled images) were given as examples of each class. [sent-168, score-0.126]

61 Shown at a resolution of 48x48 pixels, the task required careful inspection of the images in order to label them correctly. [sent-169, score-0.168]

62 The ground-truth gender values were all known with certainty (since they are rendered objects) and thus provided a means of measuring the accuracy of inferred image labels. [sent-170, score-0.285]

63 85 2 GLAD Majority Vote 3 4 5 6 7 Number of labels per image 8 Figure 3: Left: Examples of Greebles. [sent-174, score-0.347]

64 ” Right: Accuracy of the inferred labels, as a function of the number of labels M obtained for each image, of the Greeble images using either GLAD or Majority Vote. [sent-176, score-0.379]

65 We studied the effect of varying the number of labels M obtained from different labelers for each image, on the accuracy of the inferred Z. [sent-178, score-0.903]

66 Hence, from the 10 labels total we obtained per Greeble image, we randomly sampled 2 ≤ M ≤ 8 labels over all labelers during each experimental trial. [sent-179, score-1.022]

67 On each trial we compared the accuracy of labels Z as estimated by GLAD (using a threshold of 0. [sent-180, score-0.27]

68 5 on p(Z)) to labels as estimated by the Majority Vote heuristic. [sent-181, score-0.225]

69 For all values of M we tested, the labels as inferred by GLAD are signiﬁcantly higher than for Majority Vote (p < 0. [sent-184, score-0.286]

70 This means that, in order to achieve the same level of accuracy, fewer labels are needed. [sent-186, score-0.225]

71 This may stem from the lack of optimal decision rule under Majority Vote when an equal number of labelers say an image is Male as who say it is Female. [sent-189, score-0.694]

72 6 Empirical Study II: Duchenne Smiles As a second experiment, we used the Mechanical Turk to label face images containing smiles as either Duchenne or Non-Duchenne. [sent-191, score-0.25]

73 We obtained Duchenne/Non-Duchenne labels for 160 images from 20 different Mechanical Turk labelers; in total, there were 3572 labels. [sent-195, score-0.318]

74 (Hence, labelers labeled each image a variable number of times. [sent-196, score-0.731]

75 ) For ground truth, these images were also labeled by two certiﬁed experts in the Facial Action Coding System. [sent-197, score-0.183]

76 Using the labels obtained from the Mechanical Turk, we inferred the image labels using either GLAD or the Majority Vote heuristic, and then compared them to ground truth. [sent-199, score-0.654]

77 4 0 GLAD Majority Vote 200 400 600 Num of Adversarial Labels 800 Figure 5: Accuracy (percent correct) of inferred Duchenne/Non-Duchenne labels using either GLAD or Majority Vote under (left) noisy labelers or (right) adversarial labelers. [sent-216, score-1.054]

78 As the number of noise/adversarial labels increases, the performance of labels inferred using Majority Vote decreases. [sent-217, score-0.511]

79 Results: Using just the raw labels obtained from the Mechanical Turk, the labels inferred using GLAD matched the ground-truth labels on 78. [sent-219, score-0.736]

80 12% of the images, whereas labels inferred using Majority Vote were only 71. [sent-220, score-0.286]

81 Simulated Noisy and Adversarial Labelers: We also simulated noisy and adversarial labeler conditions. [sent-223, score-0.559]

82 It is to be expected, for example, that in some cases labelers may just try to complete the task in a minimum amount of time disregarding accuracy. [sent-224, score-0.572]

83 In other cases labelers may misunderstand the instructions, or may be adversarial, thus producing labels that tend to be opposite to the true labels. [sent-225, score-0.836]

84 Robustness to such noisy and adversarial labelers is important, especially as the popularity of Webbased labeling tools increases, and the quality of labelers becomes more diverse. [sent-226, score-1.419]

85 To investigate the robustness of the proposed approaches we generated data from virtual “labelers” whose labels were completely uninformative, i. [sent-227, score-0.225]

86 We also added artiﬁcial “adversarial” labelers whose labels tended to be the opposite of the true label for each image. [sent-230, score-0.911]

87 The number of noisy labels was varied from 0 to 5000 (in increments of 500), and the number of adversarial labels was varied from 0 to 750 (in increments of 250). [sent-231, score-0.684]

88 For each setting, label inference accuracy was computed for both GLAD and the Majority Vote method. [sent-232, score-0.137]

89 As shown in Figure 5, the accuracy of GLAD-based label inference is much less affected from labeling noise than is Majority Vote. [sent-233, score-0.199]

90 When adversarial labels are introduced, GLAD automatically inferred that some labelers were purposely giving the opposite label and automatically ﬂipped their labels. [sent-234, score-1.143]

91 7 7 Related Work To our knowledge GLAD is the ﬁrst model in the literature to simultaneously estimate the true label, item difﬁculty, and coder expertise in an unsupervised and efﬁcient manner. [sent-236, score-0.2]

92 Snow, et al [14] used a probabilistic model similar to Naive Bayes to show that by averaging multiple naive labelers (<= 10) one can obtain labels as accurate as a few expert labelers. [sent-243, score-0.849]

93 Two key differences between their model and GLAD are that: (1) they assume a signiﬁcant proportion of images have been pre-labeled with ground truth values, and (2) all the images have equal difﬁculty. [sent-244, score-0.254]

94 Sheng, et al [12] examine how to identify which images of an image dataset to label again in order to reduce uncertainty in the posterior probabilities of latent class labels. [sent-246, score-0.319]

95 Smyth, et al [13] used a similar approach to combine labels from multiple experts for items with homogeneous levels of difﬁculty. [sent-249, score-0.272]

96 Batchelder and Romney [2] infer test answers and test-takers’ abilities simultaneously, but do not estimate item difﬁculties and do not admit adversarial labelers. [sent-250, score-0.262]

97 Other approaches employ a Bayesian model of the labeling process that considers both variability in labeler accuracies as well as item difﬁculty (e. [sent-251, score-0.468]

98 Algorithms are needed to automatically estimate the reliability of ad-hoc anonymous labelers, the difﬁculty of the different items in the dataset, and the probability of the true labels given the currently available data. [sent-259, score-0.263]

99 The model can be used seamlessly to combine labels from both human labelers and automatic classiﬁers. [sent-262, score-0.821]

100 Experiments show that GLAD can recover the true data labels more accurately than the Majority Vote heuristic, and that it is highly robust to both noisy and adversarial labelers. [sent-263, score-0.444]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('labelers', 0.572), ('glad', 0.463), ('labeler', 0.334), ('labels', 0.225), ('vote', 0.188), ('lij', 0.177), ('adversarial', 0.164), ('zj', 0.136), ('majority', 0.125), ('image', 0.122), ('expertise', 0.12), ('culty', 0.103), ('duchenne', 0.095), ('images', 0.093), ('smiles', 0.082), ('turk', 0.076), ('label', 0.075), ('mechanical', 0.072), ('labeling', 0.062), ('inferred', 0.061), ('greeble', 0.06), ('dif', 0.053), ('greebles', 0.048), ('accuracy', 0.045), ('em', 0.041), ('gender', 0.041), ('smile', 0.041), ('item', 0.039), ('labeled', 0.037), ('accuracies', 0.033), ('experts', 0.032), ('culties', 0.032), ('noisy', 0.032), ('becoming', 0.031), ('inferring', 0.031), ('dawid', 0.031), ('simulated', 0.029), ('heuristic', 0.029), ('coders', 0.027), ('infomax', 0.027), ('irt', 0.027), ('movellan', 0.027), ('oculi', 0.027), ('orbicularis', 0.027), ('recaptcha', 0.027), ('skene', 0.027), ('soylent', 0.027), ('abilities', 0.027), ('lj', 0.027), ('proportion', 0.025), ('human', 0.024), ('sheng', 0.024), ('muscle', 0.024), ('skilled', 0.024), ('batchelder', 0.024), ('true', 0.023), ('jacob', 0.022), ('ahn', 0.022), ('certi', 0.022), ('num', 0.022), ('truth', 0.022), ('principled', 0.021), ('ground', 0.021), ('ln', 0.021), ('simulation', 0.02), ('votes', 0.02), ('male', 0.02), ('services', 0.02), ('resources', 0.02), ('expert', 0.02), ('old', 0.02), ('snow', 0.019), ('increments', 0.019), ('alpha', 0.019), ('psychometrika', 0.019), ('estimates', 0.019), ('simultaneously', 0.018), ('facial', 0.018), ('inference', 0.017), ('infer', 0.017), ('tools', 0.017), ('probabilistic', 0.017), ('certainty', 0.016), ('paul', 0.016), ('eyes', 0.016), ('smyth', 0.016), ('correct', 0.016), ('opposite', 0.016), ('answers', 0.015), ('al', 0.015), ('infers', 0.015), ('annotations', 0.015), ('vision', 0.015), ('discriminate', 0.015), ('probable', 0.015), ('automatically', 0.015), ('observed', 0.015), ('bottleneck', 0.014), ('correctness', 0.014), ('posterior', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

Author: Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, Paul L. Ruvolo

2 0.079384282 122 nips-2009-Label Selection on Graphs

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We speciﬁcally study methods which choose a single batch of labeled vertices (i.e. ofﬂine, non sequential methods). In this setting, we ﬁnd common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets. 1

3 0.076951563 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

Author: Rob Fergus, Yair Weiss, Antonio Torralba

Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efﬁcient approximations for semi-supervised learning that are linear in the number of images. Speciﬁcally, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet. 1

4 0.070749044 157 nips-2009-Multi-Label Prediction via Compressed Sensing

Author: John Langford, Tong Zhang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider multi-label prediction problems with large output spaces under the assumption of output sparsity – that the target (label) vectors have small support. We develop a general theory for a variant of the popular error correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multi-label regression problems to binary regression problems. We show that the number of subproblems need only be logarithmic in the total number of possible labels, making this approach radically more efﬁcient than others. We also state and prove robustness guarantees for this method in the form of regret transform bounds (in general), and also provide a more detailed analysis for the linear prediction setting. 1

5 0.065333165 211 nips-2009-Segmenting Scenes by Matching Image Composites

Author: Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman

Abstract: In this paper, we investigate how, given an image, similar images sharing the same global description can help with unsupervised scene segmentation. In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes. This allows for a better explanation of the input scenes. We perform MRF-based segmentation that optimizes over matches, while respecting boundary information. The recovered segments are then used to re-query a large database of images to retrieve better matches for the target regions. We show improved performance in detecting the principal occluding and contact boundaries for the scene over previous methods on data gathered from the LabelMe database.

6 0.06351538 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

7 0.06087533 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

8 0.057122074 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

9 0.053318299 96 nips-2009-Filtering Abstract Senses From Image Search Results

10 0.053122912 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

11 0.050798852 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

12 0.050485622 71 nips-2009-Distribution-Calibrated Hierarchical Classification

13 0.046885546 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

14 0.043870896 260 nips-2009-Zero-shot Learning with Semantic Output Codes

15 0.04268096 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

16 0.042416401 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

17 0.042337328 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

18 0.041465018 175 nips-2009-Occlusive Components Analysis

19 0.041136071 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

20 0.040690519 201 nips-2009-Region-based Segmentation and Object Detection

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.118), (1, -0.045), (2, -0.072), (3, -0.017), (4, -0.017), (5, 0.045), (6, -0.024), (7, 0.002), (8, 0.029), (9, 0.062), (10, -0.047), (11, 0.039), (12, -0.003), (13, -0.058), (14, 0.02), (15, 0.011), (16, 0.055), (17, 0.009), (18, 0.045), (19, -0.027), (20, 0.049), (21, -0.052), (22, -0.034), (23, 0.042), (24, 0.011), (25, 0.012), (26, -0.071), (27, -0.013), (28, -0.022), (29, -0.03), (30, -0.069), (31, -0.081), (32, -0.063), (33, 0.114), (34, -0.017), (35, 0.052), (36, 0.087), (37, 0.036), (38, 0.011), (39, -0.004), (40, -0.046), (41, -0.025), (42, 0.022), (43, 0.138), (44, -0.129), (45, 0.047), (46, -0.033), (47, -0.169), (48, 0.061), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92584515 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

Author: Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, Paul L. Ruvolo

2 0.55090821 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

Author: Natasha Singh-miller, Michael Collins

Abstract: We consider the problem of using nearest neighbor methods to provide a conditional probability estimate, P (y|a), when the number of labels y is large and the labels share some underlying structure. We propose a method for learning label embeddings (similar to error-correcting output codes (ECOCs)) to model the similarity between labels within a nearest neighbor framework. The learned ECOCs and nearest neighbor information are used to provide conditional probability estimates. We apply these estimates to the problem of acoustic modeling for speech recognition. We demonstrate signiﬁcant improvements in terms of word error rate (WER) on a lecture recognition task over a state-of-the-art baseline GMM model. 1

3 0.52706051 122 nips-2009-Label Selection on Graphs

Author: Andrew Guillory, Jeff A. Bilmes

4 0.52343094 157 nips-2009-Multi-Label Prediction via Compressed Sensing

Author: John Langford, Tong Zhang, Daniel J. Hsu, Sham M. Kakade

5 0.50210983 93 nips-2009-Fast Image Deconvolution using Hyper-Laplacian Priors

Author: Dilip Krishnan, Rob Fergus

Abstract: The heavy-tailed distribution of gradients in natural scenes have proven effective priors for a range of problems such as denoising, deblurring and super-resolution. α These distributions are well modeled by a hyper-Laplacian p(x) ∝ e−k|x| , typically with 0.5 ≤ α ≤ 0.8. However, the use of sparse distributions makes the problem non-convex and impractically slow to solve for multi-megapixel images. In this paper we describe a deconvolution approach that is several orders of magnitude faster than existing techniques that use hyper-Laplacian priors. We adopt an alternating minimization scheme where one of the two phases is a non-convex problem that is separable over pixels. This per-pixel sub-problem may be solved with a lookup table (LUT). Alternatively, for two speciﬁc values of α, 1/2 and 2/3 an analytic solution can be found, by ﬁnding the roots of a cubic and quartic polynomial, respectively. Our approach (using either LUTs or analytic formulae) is able to deconvolve a 1 megapixel image in less than ∼3 seconds, achieving comparable quality to existing methods such as iteratively reweighted least squares (IRLS) that take ∼20 minutes. Furthermore, our method is quite general and can easily be extended to related image processing problems, beyond the deconvolution application demonstrated. 1

6 0.47691864 71 nips-2009-Distribution-Calibrated Hierarchical Classification

7 0.47520378 149 nips-2009-Maximin affinity learning of image segmentation

8 0.46862048 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

9 0.467906 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

10 0.4569459 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

11 0.45014247 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

12 0.42119101 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

13 0.41615 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

14 0.41490522 96 nips-2009-Filtering Abstract Senses From Image Search Results

15 0.41404855 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

16 0.4135156 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

17 0.40872851 260 nips-2009-Zero-shot Learning with Semantic Output Codes

18 0.39287901 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

19 0.38632187 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

20 0.3814134 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.025), (25, 0.527), (35, 0.048), (36, 0.079), (39, 0.045), (58, 0.052), (71, 0.03), (81, 0.016), (86, 0.055), (91, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99517405 152 nips-2009-Measuring model complexity with the prior predictive

Author: Wolf Vanpaemel

Abstract: In the last few decades, model complexity has received a lot of press. While many methods have been proposed that jointly measure a model’s descriptive adequacy and its complexity, few measures exist that measure complexity in itself. Moreover, existing measures ignore the parameter prior, which is an inherent part of the model and affects the complexity. This paper presents a stand alone measure for model complexity, that takes the number of parameters, the functional form, the range of the parameters and the parameter prior into account. This Prior Predictive Complexity (PPC) is an intuitive and easy to compute measure. It starts from the observation that model complexity is the property of the model that enables it to ﬁt a wide range of outcomes. The PPC then measures how wide this range exactly is. keywords: Model Selection & Structure Learning; Model Comparison Methods; Perception The recent revolution in model selection methods in the cognitive sciences was driven to a large extent by the observation that computational models can differ in their complexity. Differences in complexity put models on unequal footing when their ability to approximate empirical data is assessed. Therefore, models should be penalized for their complexity when their adequacy is measured. The balance between descriptive adequacy and complexity has been termed generalizability [1, 2]. Much attention has been devoted to developing, advocating, and comparing different measures of generalizability (for a recent overview, see [3]). In contrast, measures of complexity have received relatively little attention. The aim of the current paper is to propose and illustrate a stand alone measure of model complexity, called the Prior Predictive Complexity (PPC). The PPC is based on the intuitive idea that a complex model can predict many outcomes and a simple model can predict a few outcomes only. First, I discuss existing approaches to measuring model complexity and note some of their limitations. In particular, I argue that currently existing measures ignore one important aspect of a model: the prior distribution it assumes over the parameters. I then introduce the PPC, which, unlike the existing measures, is sensitive to the parameter prior. Next, the PPC is illustrated by calculating the complexities of two popular models of information integration. 1 Previous approaches to measuring model complexity A ﬁrst approach to assess the (relative) complexity of models relies on simulated data. Simulationbased methods differ in how these artiﬁcial data are generated. A ﬁrst, atheoretical approach uses random data [4, 5]. In the semi-theoretical approach, the data are generated from some theoretically ∗ I am grateful to Michael Lee and Liz Bonawitz. 1 interesting functions, such as the exponential or the logistic function [4]. Using these approaches, the models under consideration are equally complex if each model provides the best optimal ﬁt to roughly the same number of data sets. A ﬁnal approach to generating artiﬁcial data is a theoretical one, in which the data are generated from the models of interest themselves [6, 7]. The parameter sets used in the generation can either be hand-picked by the researcher, estimated from empirical data or drawn from a previously speciﬁed distribution. If the models under consideration are equally complex, each model should provide the best optimal ﬁt to self-generated data more often than the other models under consideration do. One problem with this simulation-based approach is that it is very labor intensive. It requires generating a large amount of artiﬁcial data sets, and ﬁtting the models to all these data sets. Further, it relies on choices that are often made in an arbitrary fashion that nonetheless bias the results. For example, in the semi-theoretical approach, a crucial choice is which functions to use. Similarly, in the theoretical approach, results are heavily inﬂuenced by the parameter values used in generating the data. If they are ﬁxed, on what basis? If they are estimated from empirical data, from which data? If they are drawn randomly, from which distribution? Further, a simulation study only gives a rough idea of complexity differences but provides no direct measure reﬂecting the complexity. A number of proposals have been made to measure model complexity more directly. Consider a model M with k parameters, summarized in the parameter vector θ = (θ1 , θ2 , . . . , θk , ) which has a range indicated by Ω. Let d denote the data and p(d|θ, M ) the likelihood. The most straightforward measure of model complexity is the parametric complexity (PC), which simply counts the number of parameters: PC = k. (1) PC is attractive as a measure of model complexity since it is very easy to calculate. Further, it has a direct and well understood relation toward complexity: the more parameters, the more complex the model. It is included as the complexity term of several generalizability measures such as AIC [8] and BIC [9], and it is at the heart of the Likelihood Ratio Test. Despite this intuitive appeal, PC is not free from problems. One problem with PC is that it reﬂects only a single aspect of complexity. Also the parameter range and the functional form (the way the parameters are combined in the model equation) inﬂuence a model’s complexity, but these dimensions of complexity are ignored in PC [2, 6]. A complexity measure that takes these three dimensions into account is provided by the geometric complexity (GC) measure, which is inspired by differential geometry [10]. In GC, complexity is conceptualized as the number of distinguishable probability distributions a model can generate. It is deﬁned by GC = k n ln + ln 2 2π det I(θ|M )dθ, (2) Ω where n indicates the size of the data sample and I(θ) is the Fisher Information Matrix: Iij (θ|M ) = −Eθ ∂ 2 ln p(d|θ, M ) . ∂θi ∂θj (3) Note that I(θ|M ) is determined by the likelihood function p(d|θ, M ), which is in turn determined by the model equation. Hence GC is sensitive to the number of parameters (through k), the functional form (through I), and the range (through Ω). Quite surprisingly, GC turns out to be equal to the complexity term used in one version of Minimum Description Length (MDL), a measure of generalizability developed within the domain of information theory [2, 11, 12, 13]. GC contrasts favorably with PC, in the sense that it takes three dimensions of complexity into account rather than a single one. A major drawback of GC is that, unlike PC, it requires considerable technical sophistication to be computed, as it relies on the second derivative of the likelihood. A more important limitation of both PC and GC is that these measures are insensitive to yet another important dimension contributing to model complexity: the prior distribution over the model parameters. The relation between the parameter prior distribution and model complexity is discussed next. 2 2 Model complexity and the parameter prior The growing popularity of Bayesian methods in psychology has not only raised awareness that model complexity should be taken into account when testing models [6], it has also drawn attention to the fact that in many occasions, relevant prior information is available [14]. In Bayesian methods, there is room to incorporate this information in two different ﬂavors: as a prior distribution over the models, or as a prior distribution over the parameters. Specifying a model prior is a daunting task, so almost invariably, the model prior is taken to be uniform (but see [15] for an exception). In contrast, information regarding the parameter is much easier to include, although still challenging (e.g., [16]). There are two ways to formalize prior information about a model’s parameters: using the parameter prior range (often referred to as simply the range) and using the parameter prior distribution (often referred to as simply the prior). The prior range indicates which parameter values are allowed and which are forbidden. The prior distribution indicates which parameter values are likely and which are unlikely. Models that share the same equation and the same range but differ in the prior distribution can be considered different models (or at least different model versions), just like models that share the same equation but differ in range are different model versions. Like the parameter prior range, the parameter prior distribution inﬂuences the model complexity. In general, a model with a vague parameter prior distribution is more complex than a model with a sharply peaked parameter prior distribution, much as a model with a broad-ranged parameter is more complex than the same model where the parameter is heavily restricted. To drive home the point that the parameter prior should be considered when model complexity is assessed, consider the following “fair coin” model Mf and a “biased coin” model Mb . There is a clear intuitive complexity difference between these models: Mb is more complex than Mf . The most straightforward way to formalize these models is as follows, where ph denotes the probability of observing heads: ph = 1/2, (4) ph = θ 0≤θ≤1 p(θ) = 1, (5) for model Mf and the triplet of equations jointly deﬁne model Mb . The range forbids values smaller than 0 or greater than 1 because ph is a proportion. As Mf and Mb have a different number of parameters, both PC and GC, being sensitive to the number of parameters, pick up the difference in model complexity between the models. Alternatively, model Mf could be deﬁned as follows: ph = θ 0≤θ≤1 1 p(θ) = δ(θ − ), 2 (6) where δ(x) is the Dirac delta. Note that the model formalized in Equation 6 is exactly identical the model formalized in Equation 4. However, relying on the formulation of model Mf in Equation 6, PC and GC now judge Mf and Mb to be equally complex: both models share the same model equation (which implies they have the same number of parameters and the same functional form) and the same range for the parameter. Hence, PC and GC make an incorrect judgement of the complexity difference between both models. This misjudgement is a direct result of the insensitivity of these measures to the parameter prior. As models Mf and Mb have different prior distributions over their parameter, a measure sensitive to the prior would pick up the complexity difference between these models. Such a measure is introduced next. 3 The Prior Predictive Complexity Model complexity refers to the property of the model that enables it to predict a wide range of data patterns [2]. The idea of the PPC is to measure how wide this range exactly is. A complex model 3 can predict many outcomes, and a simple model can predict a few outcomes only. Model simplicity, then, refers to the property of placing restrictions on the possible outcomes: the greater restrictions, the greater the simplicity. To understand how model complexity is measured in the PPC, it is useful to think about the universal interval (UI) and the predicted interval (PI). The universal interval is the range of outcomes that could potentially be observed, irrespective of any model. For example, in an experiment with n binomial trials, it is impossible to observe less that zero successes, or more than n successes, so the range of possible outcomes is [0, n] . Similarly, the universal interval for a proportion is [0, 1]. The predicted interval is the interval containing all outcomes the model predicts. An intuitive way to gauge model complexity is then the cardinality of the predicted interval, relative to the cardinality of the universal interval, averaged over all m conditions or stimuli: PPC = 1 m m i=1 |PIi | . |UIi | (7) A key aspect of the PPC is deriving the predicted interval. For a parameterized likelihood-based model, prediction takes the form of a distribution over all possible outcomes for some future, yet-tobe-observed data d under some model M . This distribution is called the prior predictive distribution (ppd) and can be calculated using the law of total probability: p(d|M ) = p(d|θ, M )p(θ|M )dθ. (8) Ω Predicting the probability of unseen future data d arising under the assumption that model M is true involves integrating the probability of the data for each of the possible parameter values, p(d|θ, M ), as weighted by the prior probability of each of these values, p(θ|M ). Note that the ppd relies on the number of parameters (through the number of integrals and the likelihood), the model equation (through the likelihood), and the parameter range (through Ω). Therefore, as GC, the PPC is sensitive to all these aspects. In contrast to GC, however, the ppd, and hence the PPC, also relies on the parameter prior. Since predictions are made probabilistically, virtually all outcomes will be assigned some prior weight. This implies that, in principle, the predicted interval equals the universal interval. However, for some outcomes the assigned weight will be extremely small. Therefore, it seems reasonable to restrict the predicted interval to the smallest interval that includes some predetermined amount of the prior mass. For example, the 95% predictive interval is deﬁned by those outcomes with the highest prior mass that together make up 95% of the prior mass. Analytical solutions to the integral deﬁning the ppd are rarely available. Instead, one should rely on approximations to the ppd by drawing samples from it. In the current study, sampling was performed using WinBUGS [17, 18], a highly versatile, user friendly, and freely available software package. It contains sophisticated and relatively general-purpose Markov Chain Monte Carlo (MCMC) algorithms to sample from any distribution of interest. 4 An application example The PPC is illustrated by comparing the complexity of two popular models of information integration, which attempt to account for how people merge potentially ambiguous or conﬂicting information from various sensorial sources to create subjective experience. These models either assume that the sources of information are combined additively (the Linear Integration Model; LIM; [19]) or multiplicatively (the Fuzzy Logical Model of Perception; FLMP; [20, 21]). 4.1 Information integration tasks A typical information integration task exposes participants simultaneously to different sources of information and requires this combined experience to be identiﬁed in a forced-choice identiﬁcation task. The presented stimuli are generated from a factorial manipulation of the sources of information by systematically varying the ambiguity of each of the sources. The relevant empirical data consist 4 of, for each of the presented stimuli, the counts km of the number of times the mth stimulus was identiﬁed as one of the response alternatives, out of the tm trials on which it was presented. For example, an experiment in phonemic identiﬁcation could involve two phonemes to be identiﬁed, /ba/ and /da/ and two sources of information, auditory and visual. Stimuli are created by crossing different levels of audible speech, varying between /ba/ and /da/, with different levels of visible speech, also varying between these alternatives. The resulting set of stimuli spans a continuum between the two syllables. The participant is then asked to listen and to watch the speaker, and based on this combined audiovisual experience, to identify the syllable as being either /ba/ or /da/. In the so-called expanded factorial design, not only bimodal stimuli (containing both auditory and visual information) but also unimodal stimuli (providing only a single source of information) are presented. 4.2 Information integration models In what follows, the formal description of the LIM and the FLMP is outlined for a design with two response alternatives (/da/ or /ba/) and two sources (auditory and visual), with I and J levels, respectively. In such a two-choice identiﬁcation task, the counts km follow a Binomial distribution: km ∼ Binomial(pm , tm ), (9) where pm indicates the probability that the mth stimulus is identiﬁed as /da/. 4.2.1 Model equation The probability for the stimulus constructed with the ith level of the ﬁrst source and the jth level of the second being identiﬁed as /da/ is computed according to the choice rule: pij = s (ij, /da/) , s (ij, /da/) + s (ij, /ba/) (10) where s (ij, /da/) represents the overall degree of support for the stimulus to be /da/. The sources of information are assumed to be evaluated independently, implying that different parameters are used for the different modalities. In the present example, the degree of auditory support for /da/ is denoted by ai (i = 1, . . . , I) and the degree of visual support for /da/ by bj (j = 1, . . . , J). When a unimodal stimulus is presented, the overall degree of support for each alternative is given by s (i∗, /da/) = ai and s (∗j, /da/) = bj , where the asterisk (*) indicates the absence of information, implying that Equation 10 reduces to pi∗ = ai and p∗j = bj . (11) When a bimodal stimulus is presented, the overall degree of support for each alternative is based on the integration or blending of both these sources. Hence, for bimodal stimuli, s (ij, /da/) = ai bj , where the operator denotes the combination of both sources. Hence, Equation 10 reduces to ai bj . (12) pij = ai bj + (1 − ai ) (1 − bj ) = +, so Equation 12 becomes The LIM assumes an additive combination, i.e., pij = ai + bj . 2 (13) The FLMP, in contrast, assumes a multiplicative combination, i.e., = ×, so Equation 12 becomes ai bj . ai bj + (1 − ai )(1 − bj ) (14) pij = 5 4.2.2 Parameter prior range and distribution Each level of auditory and visual support for /da/ (i.e., ai and bj , respectively) is associated with a free parameter, which implies that the FLMP and the LIM have an equal number of free parameters, I + J. Each of these parameters is constrained to satisfy 0 ≤ ai , bj ≤ 1. The original formulations of the LIM and FLMP unfortunately left the parameter priors unspeciﬁed. However, an implicit assumption that has been commonly used is a uniform prior for each of the parameters. This assumption implicitly underlies classical and widely adopted methods for model evaluation using accounted percentage of variance or maximum likelihood. ai ∼ Uniform(0, 1) and bi ∼ Uniform(0, 1) for i = 1, . . . , I; j = 1, . . . , J. (15) The models relying on this set of uniform priors will be referred to as LIMu and FLMPu . Note that LIMu and FLMPu treat the different parameters as independent. This approach misses important information. In particular, the experimental design is such that the amount of support for each level i + 1 is always higher than for level i. Because parameter ai (or bi ) corresponds to the degree of auditory (or visual) support for a unimodal stimulus at the ith level, it seems reasonable to expect the following orderings among the parameters to hold (see also [6]): aj > ai and bj > bi for j > i. (16) The models relying on this set of ordered priors will be referred to as LIMo and FLMPo . 4.3 Complexity and experimental design It is tempting to consider model complexity as an inherent characteristic of a model. For some models and for some measures of complexity this is clearly the case. Consider, for example, model Mb . In any experimental design (i.e., a number of coin tosses), PCMb = 1. However, more generally, this is not the case. Focusing on the FLMP and the LIM, it is clear that even a simple measure as PC depends crucially on (some aspects of) the experimental design. In particular, every level corresponds to a new parameter, so PC = I + J . Similarly, GC is dependent on design choices. The PPC is not different in this respect. The design sensitivity implies that one can only make sensible conclusions about differences in model complexity by using different designs. In an information integration task, the design decisions include the type of design (expanded or not), the number of sources, the number of response alternatives, the number of levels for each source, and the number of observations for each stimulus (sample size). The present study focuses on the expanded factorial designs with two sources and two response alternatives. The additional design features were varied: both a 5 × 5 and a 8 × 2 design were considered, using three different sample sizes (20, 60 and 150, following [2]). 4.4 Results Figure 1 shows the 99% predicted interval in the 8×2 design with n = 150. Each panel corresponds to a different model. In each panel, each of the 26 stimuli is displayed on the x-axis. The ﬁrst eight stimuli correspond to the stimuli with the lowest level of visual support, and are ordered in increasing order of auditory support. The next eight stimuli correspond to the stimuli with the highest level of visual support. The next eight stimuli correspond to the unimodal stimuli where only auditory information is provided (again ranked in increasing order). The ﬁnal two stimuli are the unimodal visual stimuli. Panel A shows that the predicted interval of LIMu nearly equals the universal interval, ranging between 0 and 1. This indicates that almost all outcomes are given a non-negligible prior mass by LIMu , making it almost maximally complex. FLMPu is even more complex. The predicted interval, shown in Panel B, virtually equals the universal interval, indicating that the model predicts virtually every possible outcome. Panels C and D show the dramatic effect of incorporating relevant prior information in the models. The predicted intervals of both LIMo and FLMPo are much smaller than their counterparts using the uniform priors. Focusing on the comparison between LIM and FLMP, the PPC indicates that the latter is more complex than the former. This observation holds irrespective of the model version (assuming uniform 6 0.9 0.8 0.8 Proportion of /da/ responses 1 0.9 Proportion of /da/ responses 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 A 1* 0 *1 11 21 B 1* *1 1* *1 0.8 Proportion of /da/ responses 0.9 0.8 21 1 0.9 Proportion of /da/ responses 1 11 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 C 1* 0 *1 D Figure 1: The 99% predicted interval for each of the 26 stimuli (x-axis) according to LIMu (Panel A), FLMPu (Panel B), LIMo (Panel C), and FLMPo (Panel D). Table 1: PPC, based on the 99% predicted interval, for four models across six different designs. 20 LIMu FLMPu LIMo FLMPo 5×5 60 150 20 8×2 60 150 0.97 1 0.75 0.83 0.94 1 0.67 0.80 .97 1 0.77 0.86 0.95 1 0.69 0.82 0.93 0.99 0.64 0.78 7 0.94 0.99 0.66 0.81 vs. ordered priors). The smaller complexity of LIM is in line with previous attempts to measure the relative complexities of LIM and FLMP, such as the atheoretical simulation-based approach ([4] but see [5]), the semi-theoretical simulation-based approach [4], the theoretical simulation-based approach [2, 6, 22], and a direct computation of the GC [2]. The PPC’s for all six designs considered are displayed in Table 1. It shows that the observations made for the 8 × 2, n = 150 design holds across the ﬁve remaining designs as well: LIM is simpler than FLMP; and models assuming ordered priors are simpler than models assuming uniform priors. Note that these conclusions would not have been possible based on PC or GC. For PC, all four models have the same complexity. GC, in contrast, would detect complexity differences between LIM and FLMP (i.e., the ﬁrst conclusion), but due to its insensitivity to the parameter prior, the complexity differences between LIMu and LIMo on the one hand, and FLMPu and FLMPo on the other hand (i.e., the second conclusion) would have gone unnoticed. 5 Discussion A theorist deﬁning a model should clearly and explicitly specify at least the three following pieces of information: the model equation, the parameter prior range, and the parameter prior distribution. If any of these pieces is missing, the model should be regarded as incomplete, and therefore untestable. Consequently, any measure of generalizability should be sensitive to all three aspects of the model deﬁnition. Many currently popular generalizability measures do not satisfy this criterion, including AIC, BIC and MDL. A measure of generalizability that does take these three aspects of a model into account is the marginal likelihood [6, 7, 14, 23]. Often, the marginal likelihood is criticized exactly for its sensitivity to the prior range and distribution (e.g., [24]). However, in the light of the fact that the prior is a part of the model deﬁnition, I see the sensitivity of the marginal likelihood to the prior as an asset rather than a nuisance. It is precisely the measures of generalizability that are insensitive to the prior that miss an important aspect of the model. Similarly, any stand alone measure of model complexity should be sensitive to all three aspects of the model deﬁnition, as all three aspects contribute to the model’s complexity (with the model equation contributing two factors: the number of parameters and the functional form). Existing measures of complexity do not satisfy this requirement and are therefore incomplete. PC takes only part of the model equation into account, whereas GC takes only the model equation and the range into account. In contrast, the PPC currently proposed is sensitive to all these three aspects. It assesses model complexity using the predicted interval which contains all possible outcomes a model can generate. A narrow predicted interval (relative to the universal interval) indicates a simple model; a complex model is characterized by a wide predicted interval. There is a tight coupling between the notions of information, knowledge and uncertainty, and the notion of model complexity. As parameters correspond to unknown variables, having more information available leads to fewer parameters and hence to a simpler model. Similarly, the more information there is available, the sharper the parameter prior, implying a simpler model. To put it differently, the less uncertainty present in a model, the narrower its predicted interval, and the simpler the model. For example, in model Mb , there is maximal uncertainty. Nothing but the range is known about θ, so all values of θ are equally likely. In contrast, in model Mf , there is minimal uncertainty. In fact, ph is known for sure, so only a single value of θ is possible. This difference in uncertainty is translated in a difference in complexity. The same is true for the information integration models. Incorporating the order constraints in the priors reduces the uncertainty compared to the models without these constraints (it tells you, for example, that parameter a1 is smaller than a2 ). This reduction in uncertainty is reﬂected by a smaller complexity. There are many different sources of prior information that can be translated in a range or distribution. The illustration using the information integration models highlighted that prior information can reﬂect meaningful information in the design. Alternatively, priors can be informed by previous applications of similar models in similar settings. Probably the purest form of priors are those that translate theoretical assumptions made by a model (see [16]). The fact that it is often difﬁcult to formalize this prior information may not be used as an excuse to leave the prior unspeciﬁed. Sure it is a challenging task, but so is translating theoretical assumptions into the model equation. Formalizing theory, intuitions, and information is what model building is all about. 8 References [1] Myung, I. J. (2000) The importance of complexity in model selection. Journal of Mathematical Psychology, 44, 190–204. [2] Pitt, M. A., Myung, I. J., and Zhang, S. (2002) Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491. [3] Shiffrin, R. M., Lee, M. D., Kim, W., and Wagenmakers, E. J. (2008) A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, 1248–1284. [4] Cutting, J. E., Bruno, N., Brady, N. P., and Moore, C. (1992) Selectivity, scope, and simplicity of models: A lesson from ﬁtting judgments of perceived depth. Journal of Experimental Psychology: General, 121, 364–381. [5] Dunn, J. (2000) Model complexity: The ﬁt to random data reconsidered. Psychological Research, 63, 174–182. [6] Myung, I. J. and Pitt, M. A. (1997) Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79–95. [7] Vanpaemel, W. and Storms, G. (in press) Abstraction and model evaluation in category learning. Behavior Research Methods. [8] Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. Petrov, B. and Csaki, B. (eds.), Second International Symposium on Information Theory, pp. 267–281, Academiai Kiado. [9] Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464. [10] Myung, I. J., Balasubramanian, V., and Pitt, M. A. (2000) Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97, 11170–11175. [11] Lee, M. D. (2002) Generating additive clustering models with minimal stochastic complexity. Journal of Classiﬁcation, 19, 69–85. [12] Rissanen, J. (1996) Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47. [13] Gr¨ nwald, P. (2000) Model selection based on minimum description length. Journal of Mathematical u Psychology, 44, 133–152. [14] Lee, M. D. and Wagenmakers, E. J. (2005) Bayesian statistical inference in psychology: Comment on Traﬁmow (2003). Psychological Review, 112, 662–668. [15] Lee, M. D. and Vanpaemel, W. (2008) Exemplars, prototypes, similarities and rules in category representation: An example of hierarchical Bayesian analysis. Cognitive Science, 32, 1403–1424. [16] Vanpaemel, W. and Lee, M. D. (submitted) Using priors to formalize theory: Optimal attention and the generalized context model. [17] Lee, M. D. (2008) Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15, 1–15. [18] Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2004) WinBUGS User Manual Version 2.0. Medical Research Council Biostatistics Unit. Institute of Public Health, Cambridge. [19] Anderson, N. H. (1981) Foundations of information integration theory. Academic Press. [20] Oden, G. C. and Massaro, D. W. (1978) Integration of featural information in speech perception. Psychological Review, 85, 172–191. [21] Massaro, D. W. (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press. [22] Massaro, D. W., Cohen, M. M., Campbell, C. S., and Rodriguez, T. (2001) Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review, 8, 1–17. [23] Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association, 90, 773–795. [24] Liu, C. C. and Aitkin, M. (2008) Bayes factors: Prior sensitivity and model generalizability. Journal of Mathematical Psychology, 53, 362–375. 9

2 0.97388512 201 nips-2009-Region-based Segmentation and Object Detection

Author: Stephen Gould, Tianshi Gao, Daphne Koller

Abstract: Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other [10, 11]. However, current state-of-the-art models use a separate representation for each task making joint inference clumsy and leaving the classiﬁcation of many parts of the scene ambiguous. In this work, we propose a hierarchical region-based approach to joint object detection and image segmentation. Our approach simultaneously reasons about pixels, regions and objects in a coherent probabilistic model. Pixel appearance features allow us to perform well on classifying amorphous background classes, while the explicit representation of regions facilitate the computation of more sophisticated features necessary for object detection. Importantly, our model gives a single uniﬁed description of the scene—we explain every pixel in the image and enforce global consistency between all random variables in our model. We run experiments on the challenging Street Scene dataset [2] and show signiﬁcant improvement over state-of-the-art results for object detection accuracy. 1

3 0.96198076 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

Author: Masayuki Karasuyama, Ichiro Takeuchi

Abstract: We propose a multiple incremental decremental algorithm of Support Vector Machine (SVM). Conventional single incremental decremental SVM can update the trained model efﬁciently when single data point is added to or removed from the training set. When we add and/or remove multiple data points, this algorithm is time-consuming because we need to repeatedly apply it to each data point. The proposed algorithm is computationally more efﬁcient when multiple data points are added and/or removed simultaneously. The single incremental decremental algorithm is built on an optimization technique called parametric programming. We extend the idea and introduce multi-parametric programming for developing the proposed algorithm. Experimental results on synthetic and real data sets indicate that the proposed algorithm can signiﬁcantly reduce the computational cost of multiple incremental decremental operation. Our approach is especially useful for online SVM learning in which we need to remove old data points and add new data points in a short amount of time.

same-paper 4 0.95778126 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

Author: Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, Paul L. Ruvolo

5 0.9391951 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very signiﬁcant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufﬁciently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufﬁciently sparse model. 1

6 0.7907387 133 nips-2009-Learning models of object structure

7 0.78871119 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

8 0.77504861 211 nips-2009-Segmenting Scenes by Matching Image Composites

9 0.73675478 25 nips-2009-Adaptive Design Optimization in Experiments with People

10 0.69795871 115 nips-2009-Individuation, Identification and Object Discovery

11 0.68623894 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

12 0.68586272 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model

13 0.68339533 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

14 0.67982137 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

15 0.67793751 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships

16 0.67500013 175 nips-2009-Occlusive Components Analysis

17 0.66526687 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

18 0.65836459 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

19 0.65243798 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

20 0.65202916 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data