nips nips2013 nips2013-216 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini
Abstract: We study in this paper flat and hierarchical classification strategies in the context of large-scale taxonomies. To this end, we first propose a multiclass, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. We then introduce another type of bound targeting the approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune (or flatten) in a large-scale taxonomy. We finally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract We study in this paper flat and hierarchical classification strategies in the context of large-scale taxonomies. [sent-3, score-0.291]
2 To this end, we first propose a multiclass, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. [sent-4, score-0.485]
3 This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. [sent-5, score-0.434]
4 We then introduce another type of bound targeting the approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune (or flatten) in a large-scale taxonomy. [sent-6, score-0.229]
5 The target classes in such large-scale scenarios typically have an inherent hierarchical structure, usually in the form of a rooted tree, as in Directory Mozilla1 , or a directed acyclic graph, with a parentchild relationship. [sent-9, score-0.385]
6 Various classification techniques have been proposed for deploying classifiers in such large-scale taxonomies, from flat (sometimes referred to as big bang) approaches to fully hierarchical one adopting a complete top-down strategy. [sent-10, score-0.291]
7 Several attempts have also been made in order to develop new classification techniques that integrate, at least partly, the hierarchy into the objective function being optimized (as [3, 5, 10, 11] among others). [sent-11, score-0.195]
8 These techniques are however costly in practice and most studies either rely on a flat classifier, or a hierarchical one either deployed on the original hierarchy or a simplified version of it obtained by pruning some nodes (as [15, 18])2 . [sent-12, score-0.899]
9 This intermediate decision making leads to the error propagation phenomenon causing a decrease in accuracy. [sent-14, score-0.084]
10 On the other hand, flat classifiers rely on a single decision including all the final categories, a single decision that is however difficult to make as it involves many categories, potentially unbalanced. [sent-15, score-0.116]
11 It is thus very difficult to assess which strategy is best and there is no consensus, at the time being, on to which approach, flat or hierarchical, should be preferred on a particular category system. [sent-16, score-0.17]
12 In this paper, we address this problem and introduce new bounds on the generalization errors of classifiers deployed in large-scale taxonomies. [sent-17, score-0.158]
13 These bounds make explicit the trade-off that both flat and hierarchical classifiers encounter in large-scale taxonomies and provide an explanation to 1 www. [sent-18, score-0.554]
14 org The study in [19] introduces a slightly different simplification, through an embedding of both categories and documents into a common space. [sent-20, score-0.203]
15 To our knowledge, this is the first time that such bounds are introduced and that an explanation of the behavior of flat and hierarchical classifiers is based on theoretical grounds. [sent-22, score-0.375]
16 We also propose a well-founded way to select nodes that should be pruned so as to derive a taxonomy better suited to the classification problem. [sent-23, score-0.303]
17 Contrary to [4] that reweighs the edges in a taxonomy through a cost sensitive loss function to achieve this goal, we use here a simple pruning strategy that modifies the taxonomy in an explicit way. [sent-24, score-0.567]
18 The remainder of the paper is organized as follows: Section 2 introduces the notations used and presents the generalization error bounds for classification in large-scale taxonomies. [sent-25, score-0.143]
19 It also presents the meta-classifier we designed to select those nodes that should be pruned in the original taxonomy. [sent-26, score-0.162]
20 Section 3 illustrates these developments via experiments conducted on several taxonomies extracted from DMOZ and the International Patent Classification. [sent-27, score-0.211]
21 In the case of hierarchical classification, the hierarchy of classes H = (V, E) is defined in the form of a rooted tree, with a root ⊥ and a parent relationship π : V \ {⊥} → V where π(v) is the parent of node v ∈ V \ {⊥}, and E denotes the set of edges with parent to child orientation. [sent-32, score-0.859]
22 For each node v ∈ V \ {⊥}, we further define the set of its sisters S(v) = {v ∈ V \ {⊥}; v = v ∧ π(v) = π(v )} and its daughters D(v) = {v ∈ V \ {⊥}; π(v ) = v}. [sent-33, score-0.219]
23 The nodes at the intermediary levels of the hierarchy define general class labels while the specialized nodes at the leaf level, denoted by Y = {y ∈ V : v ∈ V, (y, v) ∈ E} ⊂ V , constitute the set of target classes. [sent-34, score-0.364]
24 In the case of flat classification, the hierarchy H is ignored, Y = V , and the problem reduces to the classical supervised multiclass classification problem. [sent-42, score-0.321]
25 1 A hierarchical Rademacher data-dependent bound Our main result is the following theorem which provides a data-dependent bound on the generalization error of a top-down multiclass hierarchical classifier. [sent-44, score-0.914]
26 The learning problem we address is then to find a hypothesis f from FB such that the generalization error of gf ∈ GFB , E(gf ) = E(x,y)∼D 1gf (x,y)≤0 , is minimal (1gf (x,y)≤0 is the 0/1 loss, equal to 1 if gf (x, y) ≤ 0 and 0 otherwise). [sent-55, score-0.449]
27 The following theorem sheds light on the trade-off between flat versus hierarchical classification. [sent-56, score-0.324]
28 The notion of function class capacity used here is the empirical Rademacher complexity [1]. [sent-57, score-0.098]
29 For flat multiclass classification, we recover the bounds of [12] by considering a hierarchy containing a root node with as many daughters as there are categories. [sent-65, score-0.607]
30 The generalization error is controlled in inequality (1) by a trade-off between the empirical error and the Rademacher complexity of the class of classifiers. [sent-68, score-0.243]
31 The Rademacher complexity term favors hierarchical classifiers over flat k ones, as any split of a set of category of size n in k parts n1 , · · · , nk ( i=1 ni = n) is such that k 2 2 i=1 ni ≤ n . [sent-69, score-0.378]
32 On the other hand, the empirical error term is likely to favor flat classifiers vs hierarchical ones, as the latter rely on a series of decisions (as many as the length of the path from the root to the chosen category in Y) and are thus more likely to make mistakes. [sent-70, score-0.529]
33 This fact is often referred to as the propagation error problem in hierarchical classification. [sent-71, score-0.331]
34 On the contrary, flat classifiers rely on a single decision and are not prone to this problem (even though the decision to be made is harder). [sent-72, score-0.116]
35 When the classification problem in Y is highly unbalanced, then the decision that a flat classifier has to make is difficult; hierarchical classifiers still have to make several decisions, but the imbalance problem is less severe on each of them. [sent-73, score-0.335]
36 So, in this case, even though the empirical error of hierarchical classifiers may be higher than the one of flat ones, the difference can be counterbalanced by the Rademacher complexity term, and the bound in Theorem 1 suggests that hierarchical classifiers should be preferred over flat ones. [sent-74, score-0.776]
37 These results have been empirically observed in different studies on classification in largescale taxonomies and are further discussed in Section 3. [sent-76, score-0.239]
38 Similarly, one way to improve the accuracy of classifiers deployed in large-scale taxonomies is to modify the taxonomy by pruning (sets of) nodes [18]. [sent-77, score-0.672]
39 By doing so, one is flattening part of the taxonomy and is once again trading-off the two terms in inequality (1): pruning nodes leads to reduce the number of decisions made by the hierarchical classifier while maintaining a reasonable Rademacher complexity. [sent-78, score-0.782]
40 Even though it can explain several empirical results obtained so far, the bound displayed in Theorem 1 does not provide a practical way to decide on whether to prune a node or not, as it would involve the training of many classifiers which is impractical with large-scale taxonomies. [sent-79, score-0.341]
41 We thus turn towards another bound in the next section that will help us design a direct and simple strategy to prune nodes in a taxonomy. [sent-80, score-0.243]
42 2 Asymptotic approximation error bounds We now propose an asymptotic approximation error bound for a multiclass logistic regression (MLR) classifier. [sent-82, score-0.383]
43 We first consider the flat, multiclass case (V = Y), and then show how the bounds can be combined in a typical top-down cascade, leading to the identification of important features that control the variation of these bounds. [sent-83, score-0.164]
44 , d}, y ∈ Y \ {y }}, and using the independence assumption and the asymptotic normality of maximum likelihood estimates (see for example [17], p. [sent-98, score-0.105]
45 Lemma 1 suggests that the predicted and asymptotic posterior probabilities are close to each other, as the quantities they are based on are close to each other. [sent-104, score-0.144]
46 Thus, provided that the asymptotic posterior probabilities between the best two classes, for any given x, are not too close to each other, the generalization error of the MLR classifier and the one of its asymptotic version should be similar. [sent-105, score-0.354]
47 Theorem 2 below states such a relationship, using the following function that measures the confusion between the best two classes for the asymptotic MLR classifier defined as : h∞ (x) = argmax P (y|x, β ∞ ) y∈Y For any given x ∈ X , the confusion between the best two classes is defined as follows. [sent-106, score-0.347]
48 4 (3) 1 Definition 1 Let f∞ (x) = maxy∈Y P (y|x, β ∞ ) be the best class posterior probability for x by the 2 asymptotic MLR classifier, and let f∞ (x) = maxy∈Y\h∞ (x) P (y|x, β ∞ ) be the second best class posterior probability for x. [sent-107, score-0.257]
49 We define the confusion of the asymptotic MLR classifier for a category set Y as: 1 2 GY (τ ) = P(x,y)∼D (|f∞ (x) − f∞ (x)| < 2τ ) for a given τ > 0. [sent-108, score-0.215]
50 The following theorem states a relationship between the generalization error of a trained MLR classifier and its asymptotic version. [sent-109, score-0.276]
51 from a probability distribution D, let i=1 hm and h∞ denote the multiclass logistic regression classifiers learned from a training set of finite size m and its asymptotic version respectively, and let E(hm ) and E(h∞ ) be their generalization errors. [sent-113, score-0.421]
52 d d j=1 R|Y|σ0 δm (4) y βj xj ), ∀x ∈ X and ∀y ∈ Y, and σ0 is a Proof (sketch) The difference E(hm ) − E(h∞ ) is bounded by the probability that the asymptotic MLR classifier h∞ correctly classifies an example (x, y) ∈ X × Y randomly chosen from D, while hm misclassifies it. [sent-115, score-0.222]
53 3 A learning based node pruning strategy Let us now consider a hierarchy of classes and a top-down classifier making decisions at each level of the hierarchy. [sent-122, score-0.752]
54 A node-based pruning strategy can be easily derived from the approximation bounds above. [sent-123, score-0.323]
55 Indeed, any node v in the hierarchy H = (V, E) is associated with three category sets: its sister categories with the node itself S (v) = S(v) ∪ {v}, its daughter categories, D(v), and the union of its sister and daughter categories, denoted F(v) = S(v) ∪ D(v). [sent-124, score-0.96]
56 These three sets of categories ⊥ ⊥ are the ones involved before and after the pruning of node . [sent-125, score-0.562]
57 Let us now denote the S(v) ∪ {v} v F(v) S MLR classifier by hmv learned Pruning from a set of sister categories D(v) . [sent-132, score-0.263]
58 of node v and the node itself, and by hDv a MLR classifier m S learned from the set of daughter categories of node v (h∞v and hDv respectively denote their asymp∞ totic versions). [sent-144, score-0.663]
59 They nevertheless exhibit the factors that play an important role in assessing whether a particular trained classifier in the logistic regression family is close or not to its asymptotic version. [sent-148, score-0.138]
60 Each node v ∈ V can then be characterized by factors in the set {|Y |, mY , nY , nY , GY (. [sent-149, score-0.151]
61 ) with two simple quantities: the average cosine similarity of all the pairs of classes in Y , and the average symmetric Kullback-Leibler divergences between all the pairs in Y of class conditional multinomial distributions. [sent-152, score-0.105]
62 negative) class to a node if the pruning of that node leads to a final performance increase (resp. [sent-154, score-0.57]
63 A meta-classifier is then trained on these features using a training set from a selected class hierarchy. [sent-156, score-0.106]
64 After the learning phase, the meta-classifier is applied to each node of a new hierarchy of classes so as to identify which nodes should be pruned. [sent-157, score-0.48]
65 3 Discussion We start our discussion by presenting results on different hierarchical datasets with different characteristics using MLR and SVM classifiers. [sent-159, score-0.337]
66 The datasets we used in these experiments are two large datasets extracted from the International Patent Classification (IPC) dataset3 and the publicly available DMOZ dataset from the second PASCAL large scale hierarchical text classification challenge (LSHTC2)4 . [sent-160, score-0.383]
67 We created 4 datasets from LSHTC2 by splitting randomly the first layer nodes (11 in total) of the original hierarchy in disjoint subsets. [sent-163, score-0.307]
68 The classes for the IPC and LSHTC2 datasets are organized in a hierarchy in which the documents are assigned to the leaf categories only. [sent-164, score-0.485]
69 CR denotes the complexity ratio between hierarchical and flat classification, given by the Rademacher complexity term in Theorem 1: v∈V \Y |D(v)|(|D(v)| − 1) / (|Y|(|Y| − 1)); the same constants B, R and L are used in the two cases. [sent-166, score-0.382]
70 On the other hand, the ratio of empirical errors (last column of Table 1) obtained with top-down hierarchical classification over flat classification when using SVM 3 4 http://www. [sent-168, score-0.353]
71 The comparison of the complexity and error ratios on all the datasets thus suggests that the flat classification strategy may be preferred on IPC, whereas the hierarchical one is more likely to be efficient on the LSHTC datasets. [sent-190, score-0.552]
72 To test our simple node pruning strategy, we learned binary classifiers aiming at deciding whether to prune a node, based on the node features described in the previous section. [sent-192, score-0.622]
73 The label associated to each node in this training set is defined as +1 if pruning the node increases the accuracy of the hierarchical classifier by at least 0. [sent-193, score-0.86]
74 1, and -1 if pruning the node decreases the accuracy by more than 0. [sent-194, score-0.382]
75 The meta-classifier is then trained to learn a mapping from the vector representation of a node (based on the above features) and the labels {+1; −1}. [sent-198, score-0.184]
76 We used the first two datasets of LSHTC2 to extract the training data while LSHTC2-3, 4, 5 and IPC were employed for testing. [sent-199, score-0.082]
77 We compare the fully flat classifier (FL) with the fully hierarchical (FH) top-down Pachinko machine, a random pruning (RN) and the proposed pruning method (PR) . [sent-204, score-0.753]
78 For the random pruning we restrict the procedure to the first two levels and perform 4 random prunings (this is the average number of prunings that are performed in our approach). [sent-205, score-0.321]
79 For each dataset we perform 5 independent runs for the random pruning and we record the best performance. [sent-206, score-0.231]
80 On all LSHTC datasets flat classification performs worse than the fully hierarchy top-down classification, for all classifiers. [sent-209, score-0.241]
81 These results are in line with complexity and empirical error ratios for SVM estimated on different collections and shown in table 1 as well as with the results obtained in [14, 7] over the same type of taxonomies. [sent-210, score-0.133]
82 Further, the work by [14] demonstrated that class hierarchies on LSHTC datasets suffer from rare categories problem, i. [sent-211, score-0.309]
83 , 80% of the target categories in such hierarchies have less than 5 documents assigned to them. [sent-213, score-0.219]
84 As a result, flat methods on such datasets face unbalanced classification problems which results in smaller error ratios; hierarchical classification should be preferred in this case. [sent-214, score-0.488]
85 This is in agreement with the conclusions obtained in recent studies, as [2, 9, 16, 6], in which the datasets considered do not have rare categories and are more well-balanced. [sent-253, score-0.229]
86 The proposed hierarchy pruning strategy aims to adapt the given taxonomy structure for better classification while maintaining the ancestor-descendant relationship between a given pair of nodes. [sent-254, score-0.621]
87 As shown in Table 2, this simple learning based pruning strategy leads to statistically significant better results for all three classifiers compared to both the original taxonomy and a randomly pruned one. [sent-255, score-0.522]
88 A similar result is reported in [18] through a pruning of an entire layer of the hierarchy, which can be seen as a generalization, even though empirical in nature, of the pruning strategy retained here. [sent-256, score-0.608]
89 Another interesting approach to modify the original taxonomy is presented in [21]. [sent-257, score-0.141]
90 4 Conclusion We have studied in this paper flat and hierarchical classification strategies in the context of largescale taxonomies, through error generalization bounds of multiclass, hierarchical classifiers. [sent-259, score-0.752]
91 The first theorem we have introduced provides an explanation to several empirical results related to the performance of such classifiers. [sent-260, score-0.11]
92 We have also introduced a well-founded way to simplify a taxonomy by selectively pruning some of its nodes, through a meta-classifier. [sent-261, score-0.372]
93 The features retained in this meta-classifier derive from the error generalization bounds we have proposed. [sent-262, score-0.172]
94 The experimental results reported here (as well as in other papers) are in line with our theoretical developments and justify the pruning strategy adopted. [sent-263, score-0.349]
95 This is the first time, to our knowledge, that a data dependent error generalization bound is proposed for multiclass, hierarchical classifiers and that a theoretical explanation is provided for the performance of flat and hierarchical classification strategies in large-scale taxonomies. [sent-264, score-0.767]
96 In particular, there is, up to now, no consensus on which classification scheme, flat or hierarchical, to use on a particular category system. [sent-265, score-0.085]
97 One of our main conclusions is that top-down hierarchical classifiers are well suited to unbalanced, large-scale taxonomies, whereas flat ones should be preferred for well-balanced taxonomies. [sent-266, score-0.38]
98 Lastly, our theoretical development also suggests possibilities to grow a hierarchy of classes from a (large) set of categories, as has been done in several studies (e. [sent-267, score-0.296]
99 Discriminative learning of relaxed hierarchy for large-scale visual recognition. [sent-332, score-0.195]
100 Improving hierarchical SVMs by hierarchy flattening and lazy classification. [sent-373, score-0.486]
wordName wordTfidf (topN-words)
[('mlr', 0.421), ('classi', 0.317), ('hierarchical', 0.291), ('pruning', 0.231), ('ipc', 0.219), ('hierarchy', 0.195), ('taxonomies', 0.179), ('ers', 0.178), ('gf', 0.172), ('er', 0.17), ('node', 0.151), ('categories', 0.15), ('taxonomy', 0.141), ('multiclass', 0.126), ('rademacher', 0.122), ('gfb', 0.113), ('asymptotic', 0.105), ('pruned', 0.096), ('hdv', 0.09), ('prune', 0.089), ('hm', 0.089), ('svm', 0.078), ('gy', 0.069), ('classes', 0.068), ('daughters', 0.068), ('lshtc', 0.068), ('sister', 0.068), ('nodes', 0.066), ('fb', 0.065), ('generalization', 0.065), ('cation', 0.065), ('daughter', 0.06), ('preferred', 0.059), ('category', 0.057), ('deployed', 0.055), ('strategy', 0.054), ('confusion', 0.053), ('decisions', 0.053), ('unbalanced', 0.052), ('datasets', 0.046), ('explanation', 0.046), ('attening', 0.045), ('dmoz', 0.045), ('hfv', 0.045), ('hmv', 0.045), ('pachinko', 0.045), ('patent', 0.045), ('pds', 0.045), ('prunings', 0.045), ('vky', 0.045), ('children', 0.044), ('decision', 0.044), ('hierarchies', 0.043), ('error', 0.04), ('flat', 0.04), ('gopal', 0.04), ('grenoble', 0.04), ('vl', 0.04), ('wv', 0.04), ('fisher', 0.039), ('posterior', 0.039), ('bounds', 0.038), ('class', 0.037), ('training', 0.036), ('fh', 0.034), ('bound', 0.034), ('trained', 0.033), ('parent', 0.033), ('rare', 0.033), ('theorem', 0.033), ('studies', 0.033), ('liblinear', 0.033), ('ratios', 0.032), ('reported', 0.032), ('developments', 0.032), ('empirical', 0.031), ('fl', 0.031), ('balanced', 0.031), ('ratio', 0.031), ('ones', 0.03), ('gs', 0.03), ('maxy', 0.03), ('trees', 0.03), ('complexity', 0.03), ('root', 0.029), ('retained', 0.029), ('cr', 0.029), ('adaboost', 0.029), ('consensus', 0.028), ('rely', 0.028), ('xj', 0.028), ('sigir', 0.027), ('largescale', 0.027), ('embedding', 0.027), ('documents', 0.026), ('collecting', 0.026), ('gd', 0.026), ('rooted', 0.026), ('categorization', 0.025), ('md', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini
Abstract: We study in this paper flat and hierarchical classification strategies in the context of large-scale taxonomies. To this end, we first propose a multiclass, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. We then introduce another type of bound targeting the approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune (or flatten) in a large-scale taxonomy. We finally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1
2 0.17530511 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell
Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1
3 0.16282134 174 nips-2013-Lexical and Hierarchical Topic Regression
Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik
Abstract: Inspired by a two-level theory from political science that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political affiliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their influence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reflects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ݐௗ௦ ݖௗ௦ ݓௗ௦ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ݕௗ ܦ ߱ ߟ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefficients η and global lexical coefficients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, specific at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More specifically, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a flat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which specifies the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ 1=ݐ 2=ݐ 1=ݐ 2=ݐ 3=ݐ 1=ݐ 2=ݐ ݐௗ௦ 2=ݏ 1=ݏ ܵ = ݏଵ 3=ݏ 2=ݏ 1=ݏ ݀=1 ݇ௗ௦ ܵ = ݏଶ ܵ = ݏ ݀=2 ߶ଵ ߟଵ ݀=ܦ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 1=ݏ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between specific words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to find the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More specifically, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us define the following probabilities. For more thorough derivations, please see the supplement. 3 • First, define vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, define the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) = 1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x } Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝ L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1 γl∗ −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The first factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The first factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional floor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and filter the vocabulary using tf-idf.3 • U.S Congressional floor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the first dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After filtering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be specified beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To find bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to filter out those that have χ2 -value less than 5, which corresponds to a significance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefficient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and fix µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefficient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features significantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the floor debates and positively-skewed distribution in Amazon reviews. For the floor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban flag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E flag constitution freedom supreme_court elections rights continuity american_flag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market financial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonprofits R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deficit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional floor debates. Many first-level topics are bipartisan (purple), while lower level topics are associated with specific ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no specific polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot fits wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difficult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F firmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router firmware ddwrt wrt54gl version wrt54g tomato linksys linux routers flash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones fit N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more specific. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the firstlevel topics have no specific polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable firmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-specific regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefficient, τ . For the floor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is flat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous field [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Griffiths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party affiliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9
4 0.15443617 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: High capacity classifiers, such as deep neural networks, often struggle on classes that have very few training examples. We propose a method for improving classification performance for such classes by discovering similar classes and transferring knowledge among them. Our method learns to organize the classes into a tree hierarchy. This tree structure imposes a prior over the classifier’s parameters. We show that the performance of deep neural networks can be improved by applying these priors to the weights in the last layer. Our method combines the strength of discriminatively trained deep neural networks, which typically require large amounts of training data, with tree-based priors, making deep neural networks work well on infrequent classes as well. We also propose an algorithm for learning the underlying tree structure. Starting from an initial pre-specified tree, this algorithm modifies the tree to make it more pertinent to the task being solved, for example, removing semantic relationships in favour of visual ones for an image classification task. Our method achieves state-of-the-art classification results on the CIFAR-100 image data set and the MIR Flickr image-text data set. 1
5 0.10866652 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting
Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang
Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1
6 0.1085894 171 nips-2013-Learning with Noisy Labels
7 0.10739779 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
8 0.10691684 335 nips-2013-Transfer Learning in a Transductive Setting
9 0.10379659 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
10 0.1004365 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
11 0.09962976 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
12 0.094978385 60 nips-2013-Buy-in-Bulk Active Learning
13 0.093033671 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
14 0.085707821 289 nips-2013-Scalable kernels for graphs with continuous attributes
15 0.083554119 211 nips-2013-Non-Linear Domain Adaptation with Boosting
16 0.0799236 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks
17 0.078479126 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
18 0.077557907 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
19 0.074277781 318 nips-2013-Structured Learning via Logistic Regression
topicId topicWeight
[(0, 0.189), (1, 0.062), (2, -0.099), (3, -0.07), (4, 0.174), (5, -0.016), (6, 0.019), (7, -0.01), (8, -0.049), (9, 0.073), (10, -0.035), (11, -0.036), (12, 0.057), (13, -0.059), (14, 0.037), (15, -0.118), (16, 0.077), (17, 0.122), (18, 0.06), (19, -0.049), (20, 0.082), (21, 0.048), (22, 0.043), (23, 0.014), (24, -0.004), (25, 0.049), (26, 0.059), (27, -0.057), (28, 0.051), (29, -0.106), (30, -0.01), (31, 0.033), (32, -0.09), (33, -0.13), (34, 0.019), (35, 0.005), (36, 0.079), (37, -0.064), (38, -0.047), (39, -0.014), (40, -0.006), (41, 0.068), (42, 0.017), (43, 0.067), (44, -0.028), (45, 0.118), (46, 0.026), (47, -0.073), (48, 0.026), (49, -0.078)]
simIndex simValue paperId paperTitle
same-paper 1 0.96928722 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini
Abstract: We study in this paper flat and hierarchical classification strategies in the context of large-scale taxonomies. To this end, we first propose a multiclass, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. We then introduce another type of bound targeting the approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune (or flatten) in a large-scale taxonomy. We finally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1
2 0.75357455 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: High capacity classifiers, such as deep neural networks, often struggle on classes that have very few training examples. We propose a method for improving classification performance for such classes by discovering similar classes and transferring knowledge among them. Our method learns to organize the classes into a tree hierarchy. This tree structure imposes a prior over the classifier’s parameters. We show that the performance of deep neural networks can be improved by applying these priors to the weights in the last layer. Our method combines the strength of discriminatively trained deep neural networks, which typically require large amounts of training data, with tree-based priors, making deep neural networks work well on infrequent classes as well. We also propose an algorithm for learning the underlying tree structure. Starting from an initial pre-specified tree, this algorithm modifies the tree to make it more pertinent to the task being solved, for example, removing semantic relationships in favour of visual ones for an image classification task. Our method achieves state-of-the-art classification results on the CIFAR-100 image data set and the MIR Flickr image-text data set. 1
3 0.75353128 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
Author: Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, Antonio Criminisi
Abstract: Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision. However, they face a fundamental limitation: given enough data, the number of nodes in decision trees will grow exponentially with depth. For certain applications, for example on mobile or embedded processors, memory is a limited resource, and so the exponential growth of trees limits their depth, and thus their potential accuracy. This paper proposes decision jungles, revisiting the idea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows these to be compact and powerful discriminative models for classification. Unlike conventional decision trees that only allow one path to every node, a DAG in a decision jungle allows multiple paths from the root to each leaf. We present and compare two new node merging algorithms that jointly optimize both the features and the structure of the DAGs efficiently. During training, node splitting and node merging are driven by the minimization of exactly the same objective function, here the weighted sum of entropies at the leaves. Results on varied datasets show that, compared to decision forests and several other baselines, decision jungles require dramatically less memory while considerably improving generalization. 1
4 0.69164437 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell
Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1
5 0.66334963 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting
Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang
Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1
6 0.65616906 335 nips-2013-Transfer Learning in a Transductive Setting
7 0.65110171 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
9 0.59096581 171 nips-2013-Learning with Noisy Labels
10 0.58426648 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
11 0.58402872 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks
12 0.56886476 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
13 0.55007488 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
14 0.54825526 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
15 0.53920561 174 nips-2013-Lexical and Hierarchical Topic Regression
16 0.51892787 226 nips-2013-One-shot learning by inverting a compositional causal process
17 0.51271164 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
18 0.50476152 47 nips-2013-Bayesian Hierarchical Community Discovery
19 0.50417995 289 nips-2013-Scalable kernels for graphs with continuous attributes
20 0.49733093 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
topicId topicWeight
[(2, 0.012), (13, 0.222), (16, 0.019), (33, 0.202), (34, 0.097), (41, 0.022), (49, 0.022), (56, 0.112), (70, 0.059), (85, 0.056), (89, 0.03), (93, 0.062)]
simIndex simValue paperId paperTitle
same-paper 1 0.83768266 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini
Abstract: We study in this paper flat and hierarchical classification strategies in the context of large-scale taxonomies. To this end, we first propose a multiclass, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. We then introduce another type of bound targeting the approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune (or flatten) in a large-scale taxonomy. We finally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1
2 0.82981527 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh
Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1
3 0.81738126 294 nips-2013-Similarity Component Analysis
Author: Soravit Changpinyo, Kuan Liu, Fei Sha
Abstract: Measuring similarity is crucial to many learning tasks. To this end, metric learning has been the dominant paradigm. However, similarity is a richer and broader notion than what metrics entail. For example, similarity can arise from the process of aggregating the decisions of multiple latent components, where each latent component compares data in its own way by focusing on a different subset of features. In this paper, we propose Similarity Component Analysis (SCA), a probabilistic graphical model that discovers those latent components from data. In SCA, a latent component generates a local similarity value, computed with its own metric, independently of other components. The final similarity measure is then obtained by combining the local similarity values with a (noisy-)OR gate. We derive an EM-based algorithm for fitting the model parameters with similarity-annotated data from pairwise comparisons. We validate the SCA model on synthetic datasets where SCA discovers the ground-truth about the latent components. We also apply SCA to a multiway classification task and a link prediction task. For both tasks, SCA attains significantly better prediction accuracies than competing methods. Moreover, we show how SCA can be instrumental in exploratory analysis of data, where we gain insights about the data by examining patterns hidden in its latent components’ local similarity values. 1
4 0.80054742 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions
Author: Rishabh K. Iyer, Stefanie Jegelka, Jeff A. Bilmes
Abstract: We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [28]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the “curvature” of the submodular function, and provide lower and upper bounds that refine and improve previous results [2, 6, 8, 27]. Our proof techniques are fairly generic. We either use a black-box transformation of the function (for approximation and learning), or a transformation of algorithms to use an appropriate surrogate function (for minimization). Curiously, curvature has been known to influence approximations for submodular maximization [3, 29], but its effect on minimization, approximation and learning has hitherto been open. We complete this picture, and also support our theoretical claims by empirical results. 1
5 0.76322222 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning
Author: Leonidas Lefakis, François Fleuret
Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between offline and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classifier response space. We propose an efficient algorithmic implementation which makes it tractable in practice, and demonstrate its efficiency experimentally on several compute-vision data-sets, on which it outperforms both online and offline methods in a memory constrained setting. 1
6 0.76135784 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
7 0.76095867 331 nips-2013-Top-Down Regularization of Deep Belief Networks
8 0.76086593 251 nips-2013-Predicting Parameters in Deep Learning
10 0.75900149 300 nips-2013-Solving the multi-way matching problem by permutation synchronization
11 0.75877166 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
12 0.75859088 335 nips-2013-Transfer Learning in a Transductive Setting
13 0.75803155 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
14 0.75750822 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
15 0.75714386 30 nips-2013-Adaptive dropout for training deep neural networks
16 0.75658727 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
17 0.75422609 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
18 0.75357425 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks
19 0.75340492 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
20 0.75210804 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions