nips nips2013 nips2013-10 knowledge-graph by maker-knowledge-mining

10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

Source: pdf

Author: George H. Chen, Stanislav Nikolov, Devavrat Shah

Abstract: For classifying time series, a nearest-neighbor approach is widely used in practice with performance often competitive with or better than more elaborate methods such as neural networks, decision trees, and support vector machines. We develop theoretical justiﬁcation for the effectiveness of nearest-neighbor-like classiﬁcation of time series. Our guiding hypothesis is that in many applications, such as forecasting which topics will become trends on Twitter, there aren’t actually that many prototypical time series to begin with, relative to the number of time series we have access to, e.g., topics become trends on Twitter only in a few distinct manners whereas we can collect massive amounts of Twitter data. To operationalize this hypothesis, we propose a latent source model for time series, which naturally leads to a “weighted majority voting” classiﬁcation rule that can be approximated by a nearest-neighbor classiﬁer. We establish nonasymptotic performance guarantees of both weighted majority voting and nearest-neighbor classiﬁcation under our model accounting for how much of the time series we observe and the model complexity. Experimental results on synthetic data show weighted majority voting achieving the same misclassiﬁcation rate as nearest-neighbor classiﬁcation while observing less of the time series. We then use weighted majority to forecast which news topics on Twitter become trends, where we are able to detect such “trending topics” in advance of Twitter 79% of the time, with a mean early advantage of 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our guiding hypothesis is that in many applications, such as forecasting which topics will become trends on Twitter, there aren’t actually that many prototypical time series to begin with, relative to the number of time series we have access to, e. [sent-7, score-0.98]

2 To operationalize this hypothesis, we propose a latent source model for time series, which naturally leads to a “weighted majority voting” classiﬁcation rule that can be approximated by a nearest-neighbor classiﬁer. [sent-10, score-0.658]

3 We establish nonasymptotic performance guarantees of both weighted majority voting and nearest-neighbor classiﬁcation under our model accounting for how much of the time series we observe and the model complexity. [sent-11, score-1.309]

4 Experimental results on synthetic data show weighted majority voting achieving the same misclassiﬁcation rate as nearest-neighbor classiﬁcation while observing less of the time series. [sent-12, score-1.071]

5 We then use weighted majority to forecast which news topics on Twitter become trends, where we are able to detect such “trending topics” in advance of Twitter 79% of the time, with a mean early advantage of 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%. [sent-13, score-0.92]

6 1 Introduction Recent years have seen an explosion in the availability of time series data related to virtually every human endeavor — data that demands to be analyzed and turned into valuable insights. [sent-14, score-0.324]

7 As a running example used throughout this paper, consider a time series that tracks how much activity there is for a particular news topic on Twitter. [sent-16, score-0.599]

8 Given this time series up to present time, we ask “will this news topic go viral? [sent-17, score-0.566]

9 ” Borrowing Twitter’s terminology, we label the time series a “trend” and call its corresponding news topic a trending topic if the news topic goes viral; otherwise, the time series has label “not trend”. [sent-18, score-1.705]

10 We seek to forecast whether a news topic will become a trend before it is declared a trend (or not) by Twitter, amounting to a binary classiﬁcation problem. [sent-19, score-0.688]

11 Importantly, we skirt the discussion of what makes a topic considered trending as this is irrelevant to our mathematical development. [sent-20, score-0.333]

12 1 Furthermore, we remark that handling the case where a single time series can have different labels at different times is beyond the scope of this paper. [sent-21, score-0.347]

13 1 While it is not public knowledge how Twitter deﬁnes a topic to be a trending topic, Twitter does provide information for which topics are trending topics. [sent-22, score-0.664]

14 More recently, researchers have examined which distance to use with nearest-neighbor classiﬁcation [2, 7, 18] or how to boost classiﬁcation performance by applying different transformations to the time series before using nearest-neighbor classiﬁcation [1]. [sent-25, score-0.324]

15 These existing results are mostly experimental, lacking theoretical justiﬁcation for both when nearest-neighbor-like time series classiﬁers should be expected to perform well and how well. [sent-26, score-0.346]

16 However, rather than examining the asymptotic case where the amount of data goes to inﬁnity, we instead pursue nonasymptotic performance guarantees in terms of how large of a training dataset we have and how much we observe of the time series to be classiﬁed. [sent-28, score-0.478]

17 We present a model for which nearest-neighbor-like classiﬁcation performs well by operationalizing the following hypothesis: In many time series applications, there are only a small number of prototypical time series relative to the number of time series we can collect. [sent-31, score-0.995]

18 In this context, we present a novel latent source model: time series are generated from a small collection of m unknown latent sources, each having one of two labels, say “trend” or “not trend”. [sent-34, score-0.842]

19 Our model’s maximum a posteriori (MAP) time series classiﬁer can be approximated by weighted majority voting, which compares the time series to be classiﬁed with each of the time series in the labeled training data. [sent-35, score-1.501]

20 Each training time series casts a weighted vote in favor of its ground truth label, with the weight depending on how similar the time series being classiﬁed is to the training example. [sent-36, score-1.069]

21 The voting is nonparametric in that it does not learn parameters for a model and is driven entirely by the training data. [sent-38, score-0.592]

22 The unknown latent sources are never estimated; the training data serve as a proxy for these latent sources. [sent-39, score-0.747]

23 Weighted majority voting itself can be approximated by a nearest-neighbor classiﬁer, which we also analyze. [sent-40, score-0.764]

24 Under our model, we show sufﬁcient conditions so that if we have n = ⇥(m log m ) time series in our training data, then weighted majority voting and nearest-neighbor classiﬁcation correctly classify a new time series with probability at least 1 after observing its ﬁrst ⌦(log m ) time steps. [sent-41, score-1.897]

25 Also, while our analysis yields matching error upper bounds for the two classiﬁers, experimental results on synthetic data suggests that weighted majority voting outperforms nearest-neighbor classiﬁcation early on when we observe very little of the time series to be classiﬁed. [sent-43, score-1.302]

26 Meanwhile, a speciﬁc instantiation of our model leads to a spherical Gaussian mixture model, where the latent sources are Gaussian mixture components. [sent-44, score-0.609]

27 We show that existing performance guarantees on learning spherical Gaussian mixture models [6, 10, 17] require more stringent conditions than what our results need, suggesting that learning the latent sources is overkill if the goal is classiﬁcation. [sent-45, score-0.563]

28 Lastly, we apply weighted majority voting to forecasting trending topics on Twitter. [sent-46, score-1.316]

29 We emphasize that our goal is precognition of trends: predicting whether a topic is going to be a trend before it is actually declared to be a trend by Twitter or, in theory, any other third party that we can collect ground truth labels from. [sent-47, score-0.534]

30 Existing work that identify trends on Twitter [3, 4, 13] instead, as part of their trend detection, deﬁne models for what trends are, which we do not do, nor do we assume we have access to such deﬁnitions. [sent-48, score-0.344]

31 ) In our experiments, weighted majority voting is able to predict whether a topic will be a trend in advance of Twitter 79% of the time, with a mean early advantage of 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%. [sent-50, score-1.401]

32 We empirically ﬁnd that the Twitter activity of a news topic that becomes a trend tends to follow one of a ﬁnite number of patterns, which could be thought of as latent sources. [sent-51, score-0.682]

33 Weighted majority voting and nearest-neighbor classiﬁcation for time series are presented in Section 2. [sent-53, score-1.088]

34 We provide our latent source model and theoretical performance guarantees of weighted majority voting and nearest-neighbor classiﬁcation under this model in Section 3. [sent-54, score-1.272]

35 Experimental results for synthetic data and forecasting trending topics on Twitter are in Section 4. [sent-55, score-0.418]

36 To do so, we have access to labeled training data R+ and R , which denote the sets of all training time series with labels +1 and 1 respectively. [sent-58, score-0.587]

37 Each positively-labeled example r 2 R+ casts a weighted vote (T ) e d (r,s) for whether time series s has label +1, where d(T ) (r, s) is some measure of similarity between the two time series r and s, superscript (T ) indicates that we are only allowed to look at the ﬁrst T time steps (i. [sent-60, score-1.129]

38 , T ) of s (but we’re allowed to look outside of these time steps for the training time series r), and constant 0 is a scaling parameter that determines the “sphere of inﬂuence” of each example. [sent-65, score-0.531]

39 Similarly, each negatively-labeled example in R also casts a weighted vote for whether time series s has label 1. [sent-66, score-0.687]

40 However, this similarity measure only looks at the ﬁrst T time T t=1 (r(t) steps of training time series r. [sent-68, score-0.509]

41 Since time series in our training data are known, we need not restrict our attention to their ﬁrst T time steps. [sent-69, score-0.482]

42 , max } kr ⇤ sk2 , T (1) where we minimize over integer time shifts with a pre-speciﬁed maximum allowed shift max 0. [sent-82, score-0.329]

43 Here, we have used q⇤ to denote time series q advanced by time steps, i. [sent-83, score-0.393]

44 Finally, we sum up all of the weighted +1 votes and then all of the weighted 1 votes. [sent-86, score-0.351]

45 The label with the majority of overall weighted votes is declared as the label for s: ( P P (T ) d(T ) (r,s) +1 if r2R+ e d (r,s) , (T ) r2R e b L (s; ) = (2) 1 otherwise. [sent-87, score-0.739]

46 b 3 A Latent Source Model and Theoretical Guarantees We assume there to be m unknown latent sources (time series) that generate observed time series. [sent-94, score-0.489]

47 Let V denote the set of all such latent sources; each latent source v : Z ! [sent-95, score-0.518]

48 Let V+ ⇢ V be the set of latent sources with label +1, and V ⇢ V be the set of those with label 1. [sent-97, score-0.66]

49 The observed time series are generated from latent sources as follows: 2 1. [sent-98, score-0.744]

50 We index time using Z for notationally convenience but will assume time series to start at time step 1. [sent-101, score-0.462]

51 The only change is that the number of training data needed will be larger by a factor of m⇡1 , where ⇡min is the smallest probability of a particular latent source min occurring. [sent-103, score-0.397]

52 3 3 activity +1 {1 +1 {1 +1 {1 time Figure 1: Example of latent sources superimposed, where each latent source is shifted vertically in amplitude such that every other latent source has label +1 and the rest have label 1. [sent-104, score-1.405]

53 R to be latent source V advanced by time steps, followed by adding noise signal E : Z ! [sent-112, score-0.377]

54 The label associated with the generated time series S is the same as that of V , i. [sent-116, score-0.444]

55 For instance, the latent sources could be tiled as shown in Figure 1, where they are evenly separated vertically and alternate between the two different classes +1 and 1. [sent-126, score-0.49]

56 With a parametric model like a k-component Gaussian mixture model, estimating these latent sources could be problematic. [sent-127, score-0.51]

57 For example, if we take any two adjacent latent sources with label +1 and cluster them, then this cluster could be confused with the latent source having label 1 that is sandwiched in between. [sent-128, score-0.99]

58 In this example, the k-component Gaussian mixture model needed for label +1 would require k to be the exact number of latent sources with label +1, which is unknown. [sent-130, score-0.728]

59 As we discuss next, for classiﬁcation, we sidestep learning the latent sources altogether, instead using training data as a proxy for latent sources. [sent-132, score-0.747]

60 If we knew the latent sources and if noise entries E(t) were i. [sent-135, score-0.42]

61 N (0, 21 ) across t, then the maximum a posteriori (MAP) estimate for label L given an observed time series S = s is ( (T ) b (T ) (s; ) = +1 if ⇤MAP (s; ) 1, LMAP (5) 1 otherwise, where (T ) ⇤MAP (s; and D+ , {0, . [sent-138, score-0.444]

62 We make a further assumption that the training data were sampled from the latent source model and that we have n different training time series. [sent-147, score-0.555]

63 Then we approximate the MAP classiﬁer by using training data as a proxy for the latent sources. [sent-155, score-0.327]

64 (7) exp min 2D kr ⇤ sk2 T r 2R 4 (T ) Plugging ⇤(T ) in place of ⇤MAP in classiﬁcation rule (5) yields the weighted majority voting rule (2). [sent-157, score-1.019]

65 Note that weighted majority voting could be interpreted as a smoothed nearest-neighbor approximation whereby we only consider the time-shifted version of each example time series that is closest to the observed time series s. [sent-158, score-1.59]

66 The resulting decision rule, which we refer to as generalized weighted majority voting, is thus: ⇢ (T ) b (T ) (s; ) = +1 if ⇤ (s, ) ✓, L✓ (8) 1 otherwise, where setting ✓ = 1 recovers the usual weighted majority voting (2). [sent-162, score-1.382]

67 T (9) This quantity measures how far apart the two different classes are if we only look at length-T chunks of each time series and allow all shifts of at most max time steps in either direction. [sent-168, score-0.52]

68 (Performance guarantee for generalized weighted majority voting) Let m+ = |V+ | be the number of latent sources with label +1, and m = |V | = m m+ be the number of latent sources with label 1. [sent-172, score-1.519]

69 Then with choice = 8 1 2 , generalized weighted majority voting (8) correctly classiﬁes a new time series S with probability at least 1 if ✓ ⇣ ◆ ⇣ ✓m m ⌘ m⌘ + G(T ) (R+ , R , max ) = ⌦ 2 log + + log(2 max + 1) + log . [sent-180, score-1.488]

70 (12) m ✓m max ) Thus, the gap between sets R+ and R needs to grow logarithmic in the number of latent sources m in order for weighted majority voting to classify correctly with high probability. [sent-181, score-1.542]

71 Assuming that the 4 We use a minimum rather a summation over time shifts to make the method more similar to existing time series classiﬁcation work (e. [sent-182, score-0.439]

72 (13) 16 2 Our generalized weighted majority voting bound (10) with ✓ = 1 (corresponding to regular weighted majority voting) and = 8 1 2 matches our nearest-neighbor classiﬁcation bound, suggesting that the two methods have similar behavior when the gap grows with T . [sent-189, score-1.431]

73 In practice, we ﬁnd weighted majority voting to outperform nearest-neighbor classiﬁcation when T is small, and then as T grows large, the two methods exhibit similar performance in agreement with our theoretical analysis. [sent-190, score-0.972]

74 Weighted majority voting, on the other hand, can recover from this situation as there may be enough correctly labeled training time series close by that contribute to a higher overall vote for the correct class. [sent-192, score-0.776]

75 This robustness of weighted majority voting makes it favorable in the online setting where we want to make a prediction as early as possible. [sent-193, score-0.956]

76 If we can estimate the latent sources accurately, then we could plug these estimates in place of the true latent sources in the MAP classiﬁer and achieve classiﬁcation performance close to optimal. [sent-195, score-0.862]

77 If we restrict the noise to be Gaussian and assume max = 0, then the latent source model corresponds to a spherical Gaussian mixture model. [sent-196, score-0.483]

78 In contrast, our result is in terms of gap G(T ) (R+ , R , max ) that depends not on the true separation between two latent sources but instead on the minimum observed separation in the training data between two time series of opposite labels. [sent-202, score-0.975]

79 In such scenarios, one could instead non-uniformly subsample O(T m3 /"2 ) time series from the training data using the procedure given in [8] and then feed the resulting smaller dataset, referred to as an (m, ")-coreset, to the EM algorithm for learning the latent sources. [sent-211, score-0.645]

80 This procedure still requires more training time series than needed for classiﬁcation and lacks a guarantee that the estimated latent sources will be close to the true latent sources. [sent-212, score-1.043]

81 6 Weighted majority voting Nearest−neighbor classifier Oracle MAP classifier 0. [sent-214, score-0.878]

82 25 Weighted majority voting Nearest−neighbor classifier Oracle MAP classifier 0. [sent-220, score-0.878]

83 time Figure 3: How news topics become trends on Twitter. [sent-231, score-0.398]

84 The top left shows some time series of activity leading up to a news topic becoming trending. [sent-232, score-0.599]

85 These time series superimposed look like clutter, but we can separate them into different clusters, as shown in the next ﬁve plots. [sent-233, score-0.353]

86 We generate m = 200 latent sources, where each latent source is constructed by ﬁrst sampling i. [sent-236, score-0.518]

87 Half of the latent sources are labeled +1 and the other half 1. [sent-240, score-0.443]

88 Then n = m log m training time series are sampled as per the latent source model where the noise added is i. [sent-241, score-0.759]

89 For = 8, we compare the classiﬁcation error rates on test data for weighted majority voting, nearest-neighbor classiﬁcation, and the MAP classiﬁer with oracle access to the true latent sources as shown in Figure 2(a). [sent-247, score-0.902]

90 We see that weighted majority voting outperforms nearest-neighbor classiﬁcation but as T grows large, the two methods’ performances converge to that of the MAP classiﬁer. [sent-248, score-0.95]

91 We see that as increases, both weighted majority voting and nearest-neighbor classiﬁcation steadily improve in performance. [sent-250, score-0.92]

92 (a) Weighted majority voting achieves a low error rate (FPR of 4%, TPR of 95%) and detects trending topics in advance of Twitter 79% of the time, with a mean of 1. [sent-255, score-1.16]

93 In practice, one could still expressly assemble the training data to have pre-speciﬁed class sizes and then tune ✓ for generalized weighted majority voting (8). [sent-261, score-1.053]

94 In our experiments, we use the usual weighted majority voting (2) (i. [sent-262, score-0.92]

95 Per topic, we created its time series based on a pre-processed version of the raw rate of how often the topic was shared, i. [sent-265, score-0.475]

96 We empirically found that how news topics become trends tends to follow a ﬁnite number of patterns; a few examples of these patterns are shown in Figure 3. [sent-268, score-0.329]

97 We applied weighted majority voting, sweeping over , T , and data pre-processing parameters. [sent-270, score-0.417]

98 As shown in Figure 4(a), one choice of parameters allows us to detect trending topics in advance of Twitter 79% of the time, and when we do, we detect them an average of 1. [sent-271, score-0.407]

99 Emerging topic detection on twitter based on temporal and social terms evaluation. [sent-297, score-0.533]

100 Querying and mining of time series data: experimental comparison of representations and distance measures. [sent-308, score-0.351]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('voting', 0.503), ('twitter', 0.372), ('majority', 0.261), ('series', 0.255), ('trending', 0.215), ('latent', 0.21), ('sources', 0.21), ('trend', 0.175), ('classi', 0.167), ('weighted', 0.156), ('news', 0.124), ('label', 0.12), ('topic', 0.118), ('topics', 0.116), ('source', 0.098), ('training', 0.089), ('fpr', 0.083), ('tpr', 0.083), ('time', 0.069), ('classify', 0.068), ('mixture', 0.068), ('forecasting', 0.065), ('trends', 0.065), ('kr', 0.059), ('er', 0.058), ('classifier', 0.057), ('cation', 0.056), ('eamonn', 0.056), ('max', 0.054), ('spherical', 0.053), ('nearest', 0.05), ('shifts', 0.046), ('casts', 0.046), ('detection', 0.043), ('declared', 0.043), ('nonasymptotic', 0.043), ('viral', 0.043), ('gap', 0.042), ('map', 0.042), ('vote', 0.041), ('kv', 0.041), ('access', 0.039), ('votes', 0.039), ('neighbor', 0.039), ('correctly', 0.038), ('log', 0.038), ('phrases', 0.038), ('manners', 0.037), ('xiaoyue', 0.037), ('early', 0.036), ('rate', 0.033), ('activity', 0.033), ('shiva', 0.033), ('advance', 0.032), ('arindam', 0.03), ('devavrat', 0.03), ('misclassifying', 0.03), ('prem', 0.03), ('schulman', 0.03), ('grows', 0.03), ('false', 0.029), ('forecast', 0.029), ('superimposed', 0.029), ('proxy', 0.028), ('observing', 0.027), ('kasiviswanathan', 0.027), ('prasad', 0.027), ('declare', 0.027), ('vertically', 0.027), ('mining', 0.027), ('steps', 0.027), ('dasgupta', 0.027), ('gaussian', 0.027), ('oracle', 0.026), ('posts', 0.026), ('hour', 0.025), ('shift', 0.025), ('management', 0.024), ('become', 0.024), ('vempala', 0.024), ('elaborate', 0.023), ('labeled', 0.023), ('decision', 0.023), ('prototypical', 0.023), ('classification', 0.023), ('separation', 0.023), ('labels', 0.023), ('generalized', 0.022), ('detect', 0.022), ('synthetic', 0.022), ('aggressive', 0.022), ('theoretical', 0.022), ('guarantees', 0.022), ('replace', 0.022), ('could', 0.022), ('allowed', 0.022), ('separated', 0.021), ('mixtures', 0.021), ('longer', 0.021), ('rule', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

Author: George H. Chen, Stanislav Nikolov, Devavrat Shah

2 0.1323366 75 nips-2013-Convex Two-Layer Modeling

Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans

Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difﬁcult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1

3 0.09444163 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan

Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classiﬁer with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectiﬁed linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efﬁcient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classiﬁcation accuracy, particularly in the case of a very small number of labeled training videos. 1

4 0.093071572 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

Author: Anima Anandkumar, Daniel Hsu, Majid Janzamin, Sham M. Kakade

Abstract: Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identiﬁed given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identiﬁable, we establish generic identiﬁability under a constraint, referred to as topic persistence. Our sufﬁcient conditions for identiﬁability involve a novel set of “higher order” expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identiﬁable w.h.p. in the overcomplete regime. Our identiﬁability results allow for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identiﬁability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition. Keywords: Overcomplete representation, admixture models, generic identiﬁability, tensor decomposition.

5 0.093033671 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini

Abstract: We study in this paper ﬂat and hierarchical classiﬁcation strategies in the context of large-scale taxonomies. To this end, we ﬁrst propose a multiclass, hierarchical data dependent bound on the generalization error of classiﬁers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of ﬂat and hierarchical classiﬁers. We then introduce another type of bound targeting the approximation error of a family of classiﬁers, and derive from it features used in a meta-classiﬁer to decide which nodes to prune (or ﬂatten) in a large-scale taxonomy. We ﬁnally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1

6 0.092122033 301 nips-2013-Sparse Additive Text Models with Low Rank Background

7 0.090308003 171 nips-2013-Learning with Noisy Labels

8 0.089303702 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks

9 0.075820625 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

10 0.074436173 98 nips-2013-Documents as multiple overlapping windows into grids of counts

11 0.074158154 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests

12 0.074134216 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

13 0.070239328 211 nips-2013-Non-Linear Domain Adaptation with Boosting

14 0.069955625 148 nips-2013-Latent Maximum Margin Clustering

15 0.069573633 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

16 0.069321543 70 nips-2013-Contrastive Learning Using Spectral Methods

17 0.066530727 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

18 0.06426432 344 nips-2013-Using multiple samples to learn mixture models

19 0.062777683 174 nips-2013-Lexical and Hierarchical Topic Regression

20 0.062014941 149 nips-2013-Latent Structured Active Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.167), (1, 0.071), (2, -0.048), (3, -0.028), (4, 0.09), (5, -0.018), (6, 0.055), (7, 0.011), (8, 0.041), (9, 0.025), (10, -0.021), (11, 0.004), (12, -0.008), (13, -0.006), (14, 0.044), (15, -0.088), (16, 0.093), (17, 0.081), (18, 0.093), (19, -0.044), (20, -0.038), (21, 0.026), (22, 0.119), (23, 0.021), (24, 0.034), (25, -0.085), (26, 0.008), (27, -0.048), (28, -0.125), (29, -0.076), (30, 0.013), (31, -0.014), (32, 0.007), (33, -0.072), (34, 0.031), (35, -0.022), (36, 0.035), (37, -0.068), (38, -0.018), (39, 0.054), (40, -0.046), (41, 0.062), (42, 0.033), (43, -0.026), (44, -0.033), (45, 0.056), (46, 0.082), (47, 0.056), (48, -0.071), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9597373 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

Author: George H. Chen, Stanislav Nikolov, Devavrat Shah

2 0.79667157 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan

3 0.64673781 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini

4 0.64210308 148 nips-2013-Latent Maximum Margin Clustering

Author: Guang-Tong Zhou, Tian Lan, Arash Vahdat, Greg Mori

Abstract: We present a maximum margin framework that clusters data using latent variables. Using latent representations enables our framework to model unobserved information embedded in the data. We implement our idea by large margin learning, and develop an alternating descent algorithm to effectively solve the resultant non-convex optimization problem. We instantiate our latent maximum margin clustering framework with tag-based video clustering tasks, where each video is represented by a latent tag model describing the presence or absence of video tags. Experimental results obtained on three standard datasets show that the proposed method outperforms non-latent maximum margin clustering as well as conventional clustering approaches. 1

5 0.63366085 75 nips-2013-Convex Two-Layer Modeling

Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans

6 0.60760063 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

7 0.60394502 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

8 0.59562075 98 nips-2013-Documents as multiple overlapping windows into grids of counts

9 0.5861001 174 nips-2013-Lexical and Hierarchical Topic Regression

10 0.58464789 301 nips-2013-Sparse Additive Text Models with Low Rank Background

11 0.5834893 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

12 0.5806514 70 nips-2013-Contrastive Learning Using Spectral Methods

13 0.57044357 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

14 0.5589062 294 nips-2013-Similarity Component Analysis

15 0.55845535 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests

16 0.55251843 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

17 0.54586643 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

18 0.54143995 85 nips-2013-Deep content-based music recommendation

19 0.52996808 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

20 0.52077836 65 nips-2013-Compressive Feature Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.011), (16, 0.023), (33, 0.126), (34, 0.083), (41, 0.023), (49, 0.036), (56, 0.094), (70, 0.025), (79, 0.013), (85, 0.027), (89, 0.402), (93, 0.044), (95, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88240546 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

Author: George H. Chen, Stanislav Nikolov, Devavrat Shah

2 0.8816641 305 nips-2013-Spectral methods for neural characterization using generalized quadratic models

Author: Il M. Park, Evan W. Archer, Nicholas Priebe, Jonathan W. Pillow

Abstract: We describe a set of fast, tractable methods for characterizing neural responses to high-dimensional sensory stimuli using a model we refer to as the generalized quadratic model (GQM). The GQM consists of a low-rank quadratic function followed by a point nonlinearity and exponential-family noise. The quadratic function characterizes the neuron’s stimulus selectivity in terms of a set linear receptive ﬁelds followed by a quadratic combination rule, and the invertible nonlinearity maps this output to the desired response range. Special cases of the GQM include the 2nd-order Volterra model [1, 2] and the elliptical Linear-Nonlinear-Poisson model [3]. Here we show that for “canonical form” GQMs, spectral decomposition of the ﬁrst two response-weighted moments yields approximate maximumlikelihood estimators via a quantity called the expected log-likelihood. The resulting theory generalizes moment-based estimators such as the spike-triggered covariance, and, in the Gaussian noise case, provides closed-form estimators under a large class of non-Gaussian stimulus distributions. We show that these estimators are fast and provide highly accurate estimates with far lower computational cost than full maximum likelihood. Moreover, the GQM provides a natural framework for combining multi-dimensional stimulus sensitivity and spike-history dependencies within a single model. We show applications to both analog and spiking data using intracellular recordings of V1 membrane potential and extracellular recordings of retinal spike trains. 1

3 0.86869293 109 nips-2013-Estimating LASSO Risk and Noise Level

Author: Mohsen Bayati, Murat A. Erdogdu, Andrea Montanari

Abstract: We study the fundamental problems of variance and risk estimation in high dimensional statistical modeling. In particular, we consider the problem of learning a coefﬁcient vector θ0 ∈ Rp from noisy linear observations y = Xθ0 + w ∈ Rn (p > n) and the popular estimation procedure of solving the 1 -penalized least squares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In this context, we develop new estimators for the 2 estimation risk θ − θ0 2 and the variance of the noise when distributions of θ0 and w are unknown. These can be used to select the regularization parameter optimally. Our approach combines Stein’s unbiased risk estimate [Ste81] and the recent results of [BM12a][BM12b] on the analysis of approximate message passing and the risk of LASSO. We establish high-dimensional consistency of our estimators for sequences of matrices X of increasing dimensions, with independent Gaussian entries. We establish validity for a broader class of Gaussian designs, conditional on a certain conjecture from statistical physics. To the best of our knowledge, this result is the ﬁrst that provides an asymptotically consistent risk estimator for the LASSO solely based on data. In addition, we demonstrate through simulations that our variance estimation outperforms several existing methods in the literature. 1

4 0.84991992 91 nips-2013-Dirty Statistical Models

Author: Eunho Yang, Pradeep Ravikumar

Abstract: We provide a uniﬁed framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the inﬁmal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our uniﬁed framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1

5 0.78188747 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification

Author: Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

Abstract: As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classiﬁcation benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the stateof-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture signiﬁcantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a smaller computational learning cost. Our hybrid architecture allows us to assess how the performance of a conventional hand-crafted image classiﬁcation pipeline changes with increased depth. We also show that convolutional networks and Fisher vector encodings are complementary in the sense that their combination further improves the accuracy. 1

6 0.75894374 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

7 0.67461377 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

8 0.63154858 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

9 0.62410736 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

10 0.61024207 237 nips-2013-Optimal integration of visual speed across different spatiotemporal frequency channels

11 0.60419828 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

12 0.59053683 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures

13 0.58535463 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

14 0.58339435 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

15 0.57996982 147 nips-2013-Lasso Screening Rules via Dual Polytope Projection

16 0.5731802 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration

17 0.57269818 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC

18 0.57201731 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

19 0.56785929 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

20 0.56323469 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization