nips nips2009 nips2009-249 knowledge-graph by maker-knowledge-mining

249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale


Source: pdf

Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck

Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We formulate and address the problem of discovering dynamic malicious regions on the Internet. [sent-7, score-0.414]

2 We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. [sent-8, score-0.66]

3 We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. [sent-9, score-0.27]

4 1 Introduction It is widely acknowledged that identifying the regions that originate malicious traffic on the Internet is vital to network security and management, e. [sent-11, score-0.309]

5 We develop new algorithms able to address these difficulties that combine the underlying approach of [11] with the sleeping experts framework of [4, 10] and the online paging problem of [20]. [sent-16, score-0.496]

6 We show how to deal with a number of practical issues that arise and demonstrate empirically on real-world datasets that this method substantially improves over existing approaches of /24 prefixes and network-aware clusters [6,19,24] in correctly identifying malicious traffic. [sent-17, score-0.323]

7 Our experiments on data sets of 126 million IP addresses demonstrate that our algorithm is able to achieve a clustering that is both highly accurate and meaningful. [sent-18, score-0.17]

8 1 Background Multiple measurement studies have indicated that malicious traffic tends to cluster in a way that aligns with the structure of the IP address space, and that this is true for many different kinds of malicious traffic – spam, scanning, botnets, and phishing [6, 18, 19, 24]. [sent-20, score-0.638]

9 Such clustered behaviour can be easily explained: most malicious traffic originates from hosts in poorly-managed networks, and networks are typically assigned contiguous blocks of the IP address space. [sent-21, score-0.382]

10 Thus, it is natural that malicious traffic is clustered in parts of the IP address space that belong to poorly-managed networks. [sent-22, score-0.35]

11 From a machine learning perspective, the problem of identifying regions of malicious activity can be viewed as one of finding a good pruning of a known decision tree – the IP address space may be naturally interpreted as a binary tree (see Fig. [sent-23, score-0.815]

12 1(a)), and the goal is to learn a pruning of this tree that is not too large and has low error in classifying IP addresses as malicious or non-malicious. [sent-24, score-0.585]

13 The structure of the IP address space suggests that there may well be a pruning with only a modest number of leaves that can classify most of the traffic accurately. [sent-25, score-0.301]

14 Thus, identifying regions of malicious activity from an online stream of labeled data is much like the problem considered by Helmbold and Schapire [11] of adaptively learning a good pruning of a known decision tree. [sent-26, score-0.483]

15 One major challenge in our application comes from the scale of the data and size of a complete decision tree over the IP address space. [sent-28, score-0.273]

16 A full decision tree over the IPv4 address space would have 232 leaves, and over the IPv6 address space (which is slowly being rolled out), 2128 leaves. [sent-29, score-0.358]

17 A second challenge comes from the fact that the regions of malicious activity may shift longitudinally over time [25]. [sent-32, score-0.323]

18 Such dynamic behaviour is a primary reason why individual IP addresses tend to be such poor indicators of future malicious traffic [15, 26]. [sent-36, score-0.408]

19 Thus, we cannot assume that the data comes from a fixed distribution over the IP address space; the algorithm needs to adapt to dynamic nature of the malicious activity, and track these changes accurately and quickly. [sent-37, score-0.434]

20 While there have been a number of measurement studies [6,18, 19,24] that have examined the origin of malicious traffic from IP address blocks that are kept fixed apriori, none of these have focused on developing online algorithms that find the best predictive IP address tree. [sent-39, score-0.49]

21 Our challenge is to develop an efficient high-accuracy online algorithm that handles the severe space constraints inherent in this problem and accounts for the dynamically changing nature of malicious behavior. [sent-40, score-0.36]

22 2 Contributions In this paper, we formulate and address the problem of discovering and tracking malicious regions of the IP address space from an online stream of data. [sent-43, score-0.509]

23 We present an algorithm that adaptively prunes the IP address tree in a way that maintains at most m leaves and performs nearly as well as the optimum adaptive pruning of the IP address tree with a comparable size. [sent-44, score-0.713]

24 Our theoretical results prove that our algorithm can predict nearly as well as the best adaptive decision tree with k leaves when using O(k log k) leaves. [sent-46, score-0.364]

25 Our experimental results demonstrate that our algorithm identifies malicious regions of the IP address space accurately, with orders of magnitude improvement over previous approaches. [sent-47, score-0.392]

26 Our experiments focus on classifying spammers and legitimate senders on two mail data sets, one with 126 million messages collected over 38 days from the mail servers of a tier-1 ISP, and a second with 28 million messages collected over 6 months from an enterprise mail server. [sent-48, score-1.032]

27 Our experiments also highlight the importance of allowing the IP address tree to be dynamic, and the resulting view of the IP address space that we get is both compelling and meaningful. [sent-49, score-0.325]

28 1: the leaves of the tree correspond to individual IP addresses, and the non-leaf nodes correspond to the remaining IP prefixes. [sent-52, score-0.298]

29 We define an IPTree TP to be a pruning of the full IP address tree: a tree whose nodes are IP prefixes P ∈ P, and whose leaves are each associated with a label, i. [sent-55, score-0.447]

30 An IPtree can thus be interpreted as a classification function for the IP addresses I: an IP address i gets the label associated with its longest matching prefix in P . [sent-58, score-0.208]

31 To make such a comparison meaningful, the offline tree must pay an additional penalty each time it changes (otherwise the offline tree would not be a meaningful point of comparison – it could change for each IP address in the sequence, and thus make no mistakes). [sent-100, score-0.417]

32 We therefore limit the kinds of changes the offline tree can make, and compare the performance of our algorithm to every IPtree with k leaves, as a function of the errors it makes and the changes it makes. [sent-101, score-0.219]

33 We define an adaptive IPtree of size k to be an adaptive tree that can (a) grow nodes over time so long as it never has more than k leaves, (b) change the labels of its leaf nodes, and (c) occasionally reconfigure itself completely. [sent-102, score-0.359]

34 At a high-level, our approach keeps a number of experts in each prefix of the IPtree, and combines their predictions to classify every IP address. [sent-106, score-0.205]

35 To update the tree, the algorithm rewards the path-nodes that predicted correctly, penalizes the incorrect ones, and modifies the tree structure if necessary. [sent-111, score-0.196]

36 We formulate this as a sleeping experts problem [4, 10]. [sent-128, score-0.262]

37 We set an expert in each node, and call them the path-node experts, and for an IP i, we consider the set of path-node experts in Pi,T to be the “awake” experts, and the rest to be “asleep”. [sent-129, score-0.243]

38 In our context, the best awake expert on the IP i corresponds to the prefix of i in the optimal IPtree, which remains sleeping until the IPtree grows that prefix. [sent-131, score-0.224]

39 2(a) illustrates the sleeping experts framework in our context: the shaded nodes are “awake” and the rest are “asleep”. [sent-133, score-0.311]

40 Specifically, let xt denote the weight of the path-node expert at node t, and let Si,T = t∈Pi,T xt . [sent-134, score-0.192]

41 To predict on IP address i, the algorithm chooses the expert at node t with probability xt /Si,T . [sent-135, score-0.313]

42 To update, the algorithm penalizes all incorrect experts in Pi,T , reducing their weight xt to γxt . [sent-136, score-0.229]

43 It then renormalizes the weights of all the experts in Pi,T so that their sum Si,T does not change. [sent-141, score-0.166]

44 (In our proof, we use a slightly different version of the sleeping experts algorithm [4]). [sent-142, score-0.282]

45 Deciding Labels of Individual Nodes Next, we need to decide whether the path-node expert at a node n should predict positive or negative. [sent-143, score-0.186]

46 We use a different experts algorithm to address this subproblem – the shifting experts algorithm [12]. [sent-144, score-0.485]

47 Specifically, we allow each node n to have two additional experts – a positive expert, which always predicts positive, and a negative expert, which always predicts negative. [sent-145, score-0.237]

48 Let yn,+ and yn,− denote the weights of the positive and negative node-label experts respectively, with yn,− + yn,+ = 1. [sent-147, score-0.166]

49 To update, when the node receives a label, it increases the weight of the correct node-label expert by ǫ, and decreases the weight of the incorrect node-label expert by ǫ (upto a maximum of 1 and a minimum of 0). [sent-149, score-0.246]

50 Note that this algorithm naturally adapts when a leaf of the optimal IPtree switches labels – the relevant node in our IPtree will slowly shift weights from the incorrect node-label expert to the correct one, making an expected 1 mistakes ǫ in the process. [sent-150, score-0.375]

51 2(b) illustrates the shifting experts setting on an IPtree: each node has two experts, a positive and a negative. [sent-152, score-0.237]

52 3 shows how it fits in with the sleeping experts algorithm. [sent-154, score-0.262]

53 Every time a leaf makes 1 mistakes, TrackIPTree splits that leaf into its children, and instantiates ǫ and initializes the relevant path-node experts and node-label experts of the children. [sent-159, score-0.469]

54 In effect, it is as if the path-node experts of the children had been asleep till this point, but will now be “awake” for the appropriate IP addresses. [sent-160, score-0.219]

55 TrackIPTree waits for 1 mistakes at each node before growing it, so that there is a little resilence ǫ with noisy data – otherwise, it would split a node every time the optimal tree made a mistake, and the IPtree would grow very quickly. [sent-161, score-0.424]

56 Note also that it naturally incorporates the optimal IPtree growing a leaf; our tree will grow the appropriate nodes when that leaf has made 1 mistakes. [sent-162, score-0.286]

57 While this excessive splitting does not impact the predictions of the path-node experts or the node-label experts significantly, we still need to ensure that the IPtree built by our algorithm does not become too large. [sent-164, score-0.352]

58 5 mistakes[t] := 0 sub G ROW T REE Input: leaf l if size(T ) ≥ m Select nodes N to discard with paging algorithm Split leaf l into children lc, rc. [sent-167, score-0.323]

59 The IPtree built by our algorithm may have at most m leaves (and thus, 2m nodes, since it is a binary tree), and so the size of its cache is 2m and the offline cache is 2k. [sent-170, score-0.2]

60 We may then select nodes to be discarded as if they were pages in the cache once the IPtree grows beyond 2m nodes; so, for example, we may choose the least recently used nodes in the IPtree, with LRU as the paging algorithm. [sent-171, score-0.258]

61 Our analysis shows that setting m = O( ǫk log k ) 2 ǫ suffices, when TrackIPTree uses F LUSH -W HEN -F ULL (FWF) as its paging algorithm – this is a simple paging algorithm that discards all the pages in the cache when the cache is full, and restarts with an empty cache. [sent-172, score-0.36]

62 We show our algorithm performs nearly as well as best adaptive k-IPtree, bounding the number of mistakes made by our algorithm as a function of the number of mistakes, number of labels changes and number of complete reconfigurations of the optimal such tree in hindsight. [sent-177, score-0.369]

63 5 Data We focus on IP addresses derived from mail data, since spammers represent a significant fraction of the malicious activity and compromised hosts on the Internet [6], and labels are relatively easy to obtain from spam-filtering run by the mail servers. [sent-189, score-0.801]

64 For our evaluation, we consider labels from the mail servers’ spam-filtering to be ground truth. [sent-190, score-0.196]

65 One data set consists of log extracts collected at the mail servers of a tier-1 ISP with 1 million active mailboxes. [sent-192, score-0.325]

66 The extracts contain the IP addresses of the mail servers that send mail to the ISP, the number of messages they sent, and the fraction of those messages that are classified as spam, aggregated over 10 minute intervals. [sent-193, score-0.639]

67 The mail server’s spam-filtering software consists of a combination of hand-crafted rules, DNS blacklists, and Brightmail [1], and we take their results as labels for our experiments. [sent-194, score-0.196]

68 The log extracts were collected over 38 days from December 2008 to January 2009, and contain 126 million IP addresses, of which 105 million are spam and 21 million are legitimate. [sent-195, score-0.402]

69 The second data set consists of log extracts from the enterprise mail server of a large corporation with 1300 active mailboxes. [sent-196, score-0.312]

70 These extracts also contain the IP addresses of mail servers that attempted to send mail, along with the number of messages they sent and the fraction of these messages that were classified spam by SpamAssassin [2], aggregated over 10 minute intervals. [sent-197, score-0.621]

71 Note that in both cases, our data only contains aggregate information about the IP addresses of the mail servers sending mail to the ISP and enterprise mail servers, and so we do not have the ability to map any information back to individual users of the ISP or enterprise mail servers. [sent-200, score-1.037]

72 TrackIPTree For the experimental results, we use LRU as the paging algorithm when nodes need to be discarded from the IPtree (Sec. [sent-201, score-0.186]

73 05 and the penalty factor γ for sleeping experts is set to 0. [sent-206, score-0.262]

74 That is, for each day, we assign labels so that the fixed clusters maximize their accuracy on spam for a given required accuracy on legitimate mail 2 . [sent-220, score-0.617]

75 Clearly, the accuracy of TrackIPTree is a tremendous improvement on both sets of apriori fixed clusters – for any choice of coverage on legitimate IPs, the accuracy of spam IPs by TrackIPTree is far higher than the apriori fixed clusters, even by as much as a factor of 2. [sent-263, score-0.616]

76 In particular, note that when the coverage required on legitimate IPs is 95%, TrackIPTree achieves 95% accuracy in classifying spam on both data sets, compared to the 35 − 45% achieved by the other clusters. [sent-265, score-0.379]

77 Table 1 shows the median number of leaves instantiated by the tree at the end of each day. [sent-267, score-0.249]

78 These numbers highlight that the apriori fixed clusters are perhaps too coarse to classify accurately in parts of the IP address space, and also are insufficiently aggregated in other parts of the address space. [sent-271, score-0.339]

79 4(c) & 4(d) show the accuracycoverage tradeoff for TrackIPTree when m ranges between 20,000-200,000 leaves for the ISP logs, and 1,000-50,000 leaves for the enterprise logs. [sent-274, score-0.284]

80 The static clustering that we compare to is a tree generated by our algorithm, but one that learns over the first z days, and then stays unchanged. [sent-279, score-0.186]

81 We compare these trees by examining separately the errors incurred on legitimate and spam IPs. [sent-281, score-0.293]

82 4(e) & 4(f) compare the errors of the z-static trees and the dynamic tree on legitimate and spam IPs respectively, using the ISP logs. [sent-287, score-0.49]

83 Clearly, both z-static trees degrade in accuracy over time, and they do so on both legitimate and spam IPs. [sent-288, score-0.328]

84 On the other hand, the accuracy of the dynamic tree does not degrade over this period. [sent-289, score-0.232]

85 Further, the in error grows with time; after 28 days, the 10-static tree has almost a factor of 2 higher error on both spam IPs and legitimate IPs. [sent-290, score-0.448]

86 Discussion and Implications Our experiments demonstrate that our algorithm is able to achieve high accuracy in predicting legitimate and spam IPs, e. [sent-291, score-0.348]

87 , it can predict 95% of the spam IPs correctly, when misclassifying only 5% of the legitimate IPs. [sent-293, score-0.331]

88 Nevertheless, our IPtree still provides insight into the malicious activity on the Internet. [sent-296, score-0.301]

89 The nodes are coloured depending on their weights, as shown in Table 2: for node t, define wt = j∈Q xj (yj,+ − yj,− ), where Q is the set of prefixes of node t (including node t itself. [sent-301, score-0.262]

90 , /8 prefixes), and the classification they output is slightly malicious; this means that an IP address without a longer matching prefix in the tree is typically classified to be malicious. [sent-304, score-0.24]

91 This might happen, for instance, if active IP addresses for this application are not distributed uniformly in the address space (and so all prefixes do not need to be grown at uniform rates), which is also what we might expect to see based on prior work [16]. [sent-307, score-0.186]

92 Nevertheless, these observations suggest that our tree does indeed capture an appropriate picture of the malicious activity on the Internet. [sent-309, score-0.456]

93 6 Other Related Work In the networking and databases literature, there has been much interest in designing streaming algorithms to identify IP prefixes with significant network traffic [7, 9, 27], but these algorithms do not explore how to predict malicious activity. [sent-310, score-0.324]

94 Previous IP-based approaches to reduce spam traffic [22, 24], as mentioned earlier, have also explored individual IP addresses, which are not particularly useful since they are so dynamic [15, 19, 25]. [sent-311, score-0.195]

95 Zhang et al [26] also examine how to predict whether known malicious IP addresses may appear at a given network, by analyzing the co-occurence of all known malicious IP addresses at a number of different networks. [sent-312, score-0.77]

96 7 Conclusion We have addressed the problem of discovering dynamic malicious regions on the Internet. [sent-315, score-0.329]

97 We model this problem as one of adaptively pruning a known decision tree, but with the additional challenges coming from real-world settings – severe space requirements and a changing target function. [sent-316, score-0.171]

98 We developed new algorithms to address this problem, by combining “experts” algorithms and online paging algorithms. [sent-317, score-0.234]

99 Efficient and effective decision tree construction on streaming data. [sent-397, score-0.209]

100 Email prioritization: Reducing delays on legitimate mail caused by junk mail. [sent-448, score-0.311]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('iptree', 0.448), ('ip', 0.407), ('trackiptree', 0.331), ('malicious', 0.265), ('pre', 0.19), ('ips', 0.174), ('mail', 0.171), ('experts', 0.166), ('tree', 0.155), ('spam', 0.153), ('legitimate', 0.14), ('isp', 0.139), ('paging', 0.117), ('xes', 0.116), ('traf', 0.107), ('mistakes', 0.103), ('addresses', 0.101), ('enterprise', 0.096), ('sleeping', 0.096), ('leaves', 0.094), ('address', 0.085), ('expert', 0.077), ('legit', 0.075), ('apriori', 0.072), ('node', 0.071), ('pruning', 0.064), ('logs', 0.064), ('expt', 0.064), ('sigcomm', 0.064), ('servers', 0.06), ('leaf', 0.058), ('clusters', 0.058), ('days', 0.057), ('ine', 0.056), ('awake', 0.051), ('coverage', 0.051), ('nodes', 0.049), ('million', 0.049), ('extracts', 0.045), ('cache', 0.043), ('dynamic', 0.042), ('classify', 0.039), ('predict', 0.038), ('activity', 0.036), ('internet', 0.036), ('accuracy', 0.035), ('messages', 0.034), ('decision', 0.033), ('day', 0.032), ('online', 0.032), ('asleep', 0.032), ('chapire', 0.032), ('dns', 0.032), ('hosts', 0.032), ('initializenode', 0.032), ('lru', 0.032), ('patscheck', 0.032), ('recon', 0.032), ('usenix', 0.032), ('static', 0.031), ('adaptively', 0.031), ('yn', 0.031), ('attacks', 0.028), ('subproblem', 0.028), ('acm', 0.027), ('xn', 0.026), ('labels', 0.025), ('grow', 0.024), ('ltering', 0.024), ('attack', 0.024), ('adaptive', 0.024), ('severe', 0.023), ('measurement', 0.023), ('ree', 0.023), ('minute', 0.023), ('subproblems', 0.023), ('xt', 0.022), ('regions', 0.022), ('changes', 0.022), ('allowed', 0.022), ('label', 0.022), ('security', 0.022), ('children', 0.021), ('splits', 0.021), ('blacklists', 0.021), ('elmbold', 0.021), ('enkataraman', 0.021), ('hitters', 0.021), ('imc', 0.021), ('prefixes', 0.021), ('reund', 0.021), ('rishnamurthy', 0.021), ('rval', 0.021), ('ung', 0.021), ('incorrect', 0.021), ('streaming', 0.021), ('changing', 0.02), ('tracking', 0.02), ('algorithm', 0.02), ('modest', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck

Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.

2 0.081846595 182 nips-2009-Optimal Scoring for Unsupervised Learning

Author: Zhihua Zhang, Guang Dai

Abstract: We are often interested in casting classification and clustering problems as a regression framework, because it is feasible to achieve some statistical properties in this framework by imposing some penalty criteria. In this paper we illustrate optimal scoring, which was originally proposed for performing the Fisher linear discriminant analysis by regression, in the application of unsupervised learning. In particular, we devise a novel clustering algorithm that we call optimal discriminant clustering. We associate our algorithm with the existing unsupervised learning algorithms such as spectral clustering, discriminative clustering and sparse principal component analysis. Experimental results on a collection of benchmark datasets validate the effectiveness of the optimal discriminant clustering algorithm.

3 0.066285461 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

Author: Daniel Zoran, Yair Weiss

Abstract: We propose a new model for natural image statistics. Instead of minimizing dependency between components of natural images, we maximize a simple form of dependency in the form of tree-dependencies. By learning filters and tree structures which are best suited for natural images we observe that the resulting filters are edge filters, similar to the famous ICA on natural images results. Calculating the likelihood of an image patch using our model requires estimating the squared output of pairs of filters connected in the tree. We observe that after learning, these pairs of filters are predominantly of similar orientations but different phases, so their joint energy resembles models of complex cells. 1 Introduction and related work Many models of natural image statistics have been proposed in recent years [1, 2, 3, 4]. A common goal of many of these models is finding a representation in which components or sub-components of the image are made as independent or as sparse as possible [5, 6, 2]. This has been found to be a difficult goal, as natural images have a highly intricate structure and removing dependencies between components is hard [7]. In this work we take a different approach, instead of minimizing dependence between components we try to maximize a simple form of dependence - tree dependence. It would be useful to place this model in context of previous works about natural image statistics. Many earlier models are described by the marginal statistics solely, obtaining a factorial form of the likelihood: p(x) = pi (xi ) (1) i The most notable model of this approach is Independent Component Analysis (ICA), where one seeks to find a linear transformation which maximizes independence between components (thus fitting well with the aforementioned factorization). This model has been applied to many scenarios, and proved to be one of the great successes of natural image statistics modeling with the emergence of edge-filters [5]. This approach has two problems. The first is that dependencies between components are still very strong, even with those learned transformation seeking to remove them. Second, it has been shown that ICA achieves, after the learned transformation, only marginal gains when measured quantitatively against simpler method like PCA [7] in terms of redundancy reduction. A different approach was taken recently in the form of radial Gaussianization [8], in which components which are distributed in a radially symmetric manner are made independent by transforming them non-linearly into a radial Gaussian, and thus, independent from one another. A more elaborate approach, related to ICA, is Independent Subspace Component Analysis or ISA. In this model, one looks for independent subspaces of the data, while allowing the sub-components 1 Figure 1: Our model with respect to marginal models such as ICA (a), and ISA like models (b). Our model, being a tree based model (c), allows components to belong to more than one subspace, and the subspaces are not required to be independent. of each subspace to be dependent: p(x) = pk (xi∈K ) (2) k This model has been applied to natural images as well and has been shown to produce the emergence of phase invariant edge detectors, akin to complex cells in V1 [2]. Independent models have several shortcoming, but by far the most notable one is the fact that the resulting components are, in fact, highly dependent. First, dependency between the responses of ICA filters has been reported many times [2, 7]. Also, dependencies between ISA components has also been observed [9]. Given these robust dependencies between filter outputs, it is somewhat peculiar that in order to get simple cell properties one needs to assume independence. In this work we ask whether it is possible to obtain V1 like filters in a model that assumes dependence. In our model we assume the filter distribution can be described by a tree graphical model [10] (see Figure 1). Degenerate cases of tree graphical models include ICA (in which no edges are present) and ISA (in which edges are only present within a subspace). But in its non-degenerate form, our model assumes any two filter outputs may be dependent. We allow components to belong to more than one subspace, and as a result, do not require independence between them. 2 Model and learning Our model is comprised of three main components. Given a set of patches, we look for the parameters which maximize the likelihood of a whitened natural image patch z: N p(yi |ypai ; β) p(z; W, β, T ) = p(y1 ) (3) i=1 Where y = Wz, T is the tree structure, pai denotes the parent of node i and β is a parameter of the density model (see below for the details). The three components we are trying to learn are: 1. The filter matrix W, where every row defines one of the filters. The response of these filters is assumed to be tree-dependent. We assume that W is orthogonal (and is a rotation of a whitening transform). 2. The tree structure T which specifies which components are dependent on each other. 3. The probability density function for connected nodes in the tree, which specify the exact form of dependency between nodes. All three together describe a complete model for whitened natural image patches, allowing likelihood estimation and exact inference [11]. We perform the learning in an iterative manner: we start by learning the tree structure and density model from the entire data set, then, keeping the structure and density constant, we learn the filters via gradient ascent in mini-batches. Going back to the tree structure we repeat the process many times iteratively. It is important to note that both the filter set and tree structure are learned from the data, and are continuously updated during learning. In the following sections we will provide details on the specifics of each part of the model. 2 β=0.0 β=0.5 β=1.0 β=0.0 β=0.5 β=1.0 2 1 1 1 1 1 1 0 −1 0 −1 −2 −1 −2 −3 0 x1 2 0 x1 2 0 x1 2 −2 −3 −2 0 x1 2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 −3 −2 0 x1 2 −2 0 x1 2 Figure 2: Shape of the conditional (Left three plots) and joint (Right three plots) density model in log scale for several values of β, from dependence to independence. 2.1 Learning tree structure In their seminal paper, Chow and Liu showed how to learn the optimal tree structure approximation for a multidimensional probability density function [12]. This algorithm is easy to apply to this scenario, and requires just a few simple steps. First, given the current estimate for the filter matrix W, we calculate the response of each of the filters with all the patches in the data set. Using these responses, we calculate the mutual information between each pair of filters (nodes) to obtain a fully connected weighted graph. The final step is to find a maximal spanning tree over this graph. The resulting unrooted tree is the optimal tree approximation of the joint distribution function over all nodes. We will note that the tree is unrooted, and the root can be chosen arbitrarily - this means that there is no node, or filter, that is more important than the others - the direction in the tree graph is arbitrary as long as it is chosen in a consistent way. 2.2 Joint probability density functions Gabor filter responses on natural images exhibit highly kurtotic marginal distributions, with heavy tails and sharp peaks [13, 3, 14]. Joint pair wise distributions also exhibit this same shape with varying degrees of dependency between the components [13, 2]. The density model we use allows us to capture both the highly kurtotic nature of the distributions, while still allowing varying degrees of dependence using a mixing variable. We use a mix of two forms of finite, zero mean Gaussian Scale Mixtures (GSM). In one, the components are assumed to be independent of each other and in the other, they are assumed to be spherically distributed. The mixing variable linearly interpolates between the two, allowing us to capture the whole range of dependencies: p(x1 , x2 ; β) = βpdep (x1 , x2 ) + (1 − β)pind (x1 , x2 ) (4) When β = 1 the two components are dependent (unless p is Gaussian), whereas when β = 0 the two components are independent. For the density functions themselves, we use a finite GSM. The dependent case is a scale mixture of bivariate Gaussians: 2 πk N (x1 , x2 ; σk I) pdep (x1 , x2 ) = (5) k While the independent case is a product of two independent univariate Gaussians: 2 πk N (x1 ; σk ) pind (x1 , x2 ) = k 2 πk N (x2 ; σk ) (6) k 2 Estimating parameters πk and σk for the GSM is done directly from the data using Expectation Maximization. These parameters are the same for all edges and are estimated only once on the first iteration. See Figure 2 for a visualization of the conditional distribution functions for varying values of β. We will note that the marginal distributions for the two types of joint distributions above are the same. The mixing parameter β is also estimated using EM, but this is done for each edge in the tree separately, thus allowing our model to theoretically capture the fully independent case (ICA) and other degenerate models such as ISA. 2.3 Learning tree dependent components Given the current tree structure and density model, we can now learn the matrix W via gradient ascent on the log likelihood of the model. All learning is performed on whitened, dimensionally 3 reduced patches. This means that W is a N × N rotation (orthonormal) matrix, where N is the number of dimensions after dimensionality reduction (see details below). Given an image patch z we multiply it by W to get the response vector y: y = Wz (7) Now we can calculate the log likelihood of the given patch using the tree model (which we assume is constant at the moment): N log p(yi |ypai ) log p(y) = log p(yroot ) + (8) i=1 Where pai denotes the parent of node i. Now, taking the derivative w.r.t the r-th row of W: ∂ log p(y) ∂ log p(y) T = z ∂Wr ∂yr (9) Where z is the whitened natural image patch. Finally, we can calculate the derivative of the log likelihood with respect to the r-th element in y: ∂ log p(ypar , yr ) ∂ log p(y) = + ∂yr ∂yr c∈C(r) ∂ log p(yr , yc ) ∂ log p(yr ) − ∂yr ∂yr (10) Where C(r) denote the children of node r. In summary, the gradient ascent rule for updating the rotation matrix W is given by: t+1 t Wr = Wr + η ∂ log p(y) T z ∂yr (11) Where η is the learning rate constant. After update, the rows of W are orthonormalized. This gradient ascent rule is applied for several hundreds of patches (see details below), after which the tree structure is learned again as described in Section 2.1, using the new filter matrix W, repeating this process for many iterations. 3 Results and analysis 3.1 Validation Before running the full algorithm on natural image data, we wanted to validate that it does produce sensible results with simple synthetic data. We generated noise from four different models, one is 1/f independent Gaussian noise with 8 Discrete Cosine Transform (DCT) filters, the second is a simple ICA model with 8 DCT filters, and highly kurtotic marginals. The third was a simple ISA model - 4 subspaces, each with two filters from the DCT filter set. Distribution within the subspace was a circular, highly kurtotic GSM, and the subspaces were sampled independently. Finally, we generated data from a simple synthetic tree of DCT filters, using the same joint distributions as for the ISA model. These four synthetic random data sets were given to the algorithm - results can be seen in Figure 3 for the ICA, ISA and tree samples. In all cases the model learned the filters and distribution correctly, reproducing both the filters (up to rotations within the subspace in ISA) and the dependency structure between the different filters. In the case of 1/f Gaussian noise, any whitening transformation is equally likely and any value of beta is equally likely. Thus in this case, the algorithm cannot find the tree or the filters. 3.2 Learning from natural image patches We then ran experiments with a set of natural images [9]1 . These images contain natural scenes such as mountains, fields and lakes. . The data set was 50,000 patches, each 16 × 16 pixels large. The patches’ DC was removed and they were then whitened using PCA. Dimension was reduced from 256 to 128 dimensions. The GSM for the density model had 16 components. Several initial 1 available at http://www.cis.hut.fi/projects/ica/imageica/ 4 Figure 3: Validation of the algorithm. Noise was generated from three models - top row is ICA, middle row is ISA and bottom row is a tree model. Samples were then given to the algorithm. On the right are the resulting learned tree models. Presented are the learned filters, tree model (with white edges meaning β = 0, black meaning β = 1 and grays intermediate values) and an example of a marginal histogram for one of the filters. It can be seen that in all cases all parts of the model were correctly learned. Filters in the ISA case were learned up to rotation within the subspace, and all filters were learned up to sign. β values for the ICA case were always below 0.1, as were the values of β between subspaces in ISA. conditions for the matrix W were tried out (random rotations, identity) but this had little effect on results. Mini-batches of 10 patches each were used for the gradient ascent - the gradient of 10 patches was summed, and then normalized to have unit norm. The learning rate constant η was set to 0.1. Tree structure learning and estimation of the mixing variable β were done every 500 mini-batches. All in all, 50 iterations were done over the data set. 3.3 Filters and tree structure Figures 4 and 5 show the learned filters (WQ where Q is the whitening matrix) and tree structure (T ) learned from natural images. Unlike the ISA toy data in figure 3, here a full tree was learned and β is approximately one for all edges. The GSM that was learned for the marginals was highly kurtotic. It can be seen that resulting filters are edge filters at varying scales, positions and orientations. This is similar to the result one gets when applying ICA to natural images [5, 15]. More interesting is Figure 4: Left: Filter set learned from 16 × 16 natural image patches. Filters are ordered by PCA eigenvalues, largest to smallest. Resulting filters are edge filters having different orientations, positions, frequencies and phases. Right: The “feature” set learned, that is, columns of the pseudo inverse of the filter set. 5 Figure 5: The learned tree graph structure and feature set. It can be seen that neighboring features on the graph have similar orientation, position and frequency. See Figure 4 for a better view of the feature details, and see text for full detail and analysis. Note that the figure is rotated CW. 6 Optimal Orientation Optimal Frequency 3.5 Optimal Phase 7 3 0.8 6 2.5 Optimal Position Y 0.9 6 5 5 0.7 0.6 3 Child 1.5 4 Child 2 Child Child 4 3 2 1 1 0.4 0.3 2 0.5 0.5 0.2 0 1 0.1 0 0 1 2 Parent 3 4 0 0 2 4 Parent 6 8 0 0 1 2 3 Parent 4 5 6 0 0.2 0.4 0.6 Parent 0.8 1 Figure 6: Correlation of optimal parameters in neighboring nodes in the tree graph. Orientation, frequency and position are highly correlated, while phase seems to be entirely uncorrelated. This property of correlation in frequency and orientation, while having no correlation in phase is related to the ubiquitous energy model of complex cells in V1. See text for further details. Figure 7: Left: Comparison of log likelihood values of our model with PCA, ICA and ISA. Our model gives the highest likelihood. Right: Samples taken at random from ICA, ISA and our model. Samples from our model appear to contain more long-range structure. the tree graph structure learned along with the filters which is shown in Figure 5. It can be seen that neighboring filters (nodes) in the tree tend to have similar position, frequency and orientation. Figure 6 shows the correlation of optimal frequency, orientation and position for neighboring filters in the tree - it is obvious that all three are highly correlated. Also apparent in this figure is the fact that the optimal phase for neighboring filters has no significant correlation. It has been suggested that filters which have the same orientation, frequency and position with different phase can be related to complex cells in V1 [2, 16]. 3.4 Comparison to other models Since our model is a generalization of both ICA and ISA we use it to learn both models. In order to learn ICA we used the exact same data set, but the tree had no edges and was not learned from the data (alternatively, we could have just set β = 0). For ISA we used a forest architecture of 2 node trees, setting β = 1 for all edges (which means a spherical symmetric distribution), no tree structure was learned. Both models produce edge filters similar to what we learn (and to those in [5, 15, 6]). The ISA model produces neighboring nodes with similar frequency and orientation, but different phase, as was reported in [2]. We also compare to a simple PCA whitening transform, using the same whitening transform and marginals as in the ICA case, but setting W = I. We compare the likelihood each model gives for a test set of natural image patches, different from the one that was used in training. There were 50,000 patches in the test set, and we calculate the mean log likelihood over the entire set. The table in Figure 7 shows the result - as can be seen, our model performs better in likelihood terms than both ICA and ISA. Using a tree model, as opposed to more complex graphical models, allows for easy sampling from the model. Figure 7 shows 20 random samples taken from our tree model along with samples from the ICA and ISA models. Note the elongated structures (e.g. in the bottom left sample) in the samples from the tree model, and compare to patches sampled from the ICA and ISA models. 7 40 40 30 30 20 20 10 1 10 0.8 0.6 0.4 0 0.2 0 0 1 2 3 Orientation 4 0 0 2 4 6 Frequency 8 0 2 4 Phase Figure 8: Left: Interpretation of the model. Given a patch, the response of all edge filters is computed (“simple cells”), then at each edge, the corresponding nodes are squared and summed to produce the response of the “complex cell” this edge represents. Both the response of complex cells and simple cells is summed to produce the likelihood of the patch. Right: Response of a “complex cell” in our model to changing phase, frequency and orientation. Response in the y-axis is the sum of squares of the two filters in this “complex cell”. Note that while the cell is selective to orientation and frequency, it is rather invariant to phase. 3.5 Tree models and complex cells One way to interpret the model is looking at the likelihood of a given patch under this model. For the case of β = 1 substituting Equation 4 into Equation 3 yields: 2 2 ρ( yi + ypai ) − ρ(|ypai |) log L(z) = (12) i 2 Where ρ(x) = log k πk N (x; σk ) . This form of likelihood has an interesting similarity to models of complex cells in V1 [2, 4]. In Figure 8 we draw a simple two-layer network that computes the likelihood. The first layer applies linear filters (“simple cells”) to the image patch, while the second layer sums the squared outputs of similarly oriented filters from the first layer, having different phases, which are connected in the tree (“complex cells”). Output is also dependent on the actual response of the “simple cell” layer. The likelihood here is maximized when both the response of the parent filter ypai and the child yi is zero, but, given that one filter has responded with a non-zero value, the likelihood is maximized when the other filter also fires (see the conditional density in Figure 2). Figure 8 also shows an example of the phase invariance which is present in the learned

4 0.061607409 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

Author: Parikshit Ram, Dongryeol Lee, William March, Alexander G. Gray

Abstract: Several key computational bottlenecks in machine learning involve pairwise distance computations, including all-nearest-neighbors (ďŹ nding the nearest neighbor(s) for each point, e.g. in manifold learning) and kernel summations (e.g. in kernel density estimation or kernel machines). We consider the general, bichromatic case for these problems, in addition to the scientiďŹ c problem of N-body simulation. In this paper we show for the ďŹ rst time O(đ?‘ ) worst case runtimes for practical algorithms for these problems based on the cover tree data structure [1]. 1

5 0.054201689 129 nips-2009-Learning a Small Mixture of Trees

Author: M. P. Kumar, Daphne Koller

Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.

6 0.050996624 31 nips-2009-An LP View of the M-best MAP problem

7 0.050867829 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

8 0.047918059 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

9 0.044929627 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

10 0.044525284 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

11 0.043150686 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions

12 0.042628452 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

13 0.041412 71 nips-2009-Distribution-Calibrated Hierarchical Classification

14 0.041294884 122 nips-2009-Label Selection on Graphs

15 0.040632471 161 nips-2009-Nash Equilibria of Static Prediction Games

16 0.034173571 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

17 0.033690535 95 nips-2009-Fast subtree kernels on graphs

18 0.033664629 76 nips-2009-Efficient Learning using Forward-Backward Splitting

19 0.03301901 48 nips-2009-Bootstrapping from Game Tree Search

20 0.031962223 246 nips-2009-Time-Varying Dynamic Bayesian Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.105), (1, 0.014), (2, 0.003), (3, -0.021), (4, 0.003), (5, 0.008), (6, -0.008), (7, 0.028), (8, -0.032), (9, -0.042), (10, -0.106), (11, -0.012), (12, 0.041), (13, 0.003), (14, 0.025), (15, -0.028), (16, 0.051), (17, 0.009), (18, 0.04), (19, 0.063), (20, 0.065), (21, -0.023), (22, 0.005), (23, 0.044), (24, 0.03), (25, 0.014), (26, 0.034), (27, 0.057), (28, 0.03), (29, -0.133), (30, -0.034), (31, -0.043), (32, -0.044), (33, 0.036), (34, 0.004), (35, -0.005), (36, 0.074), (37, 0.045), (38, 0.129), (39, -0.031), (40, 0.043), (41, -0.003), (42, -0.004), (43, -0.052), (44, -0.046), (45, -0.02), (46, -0.017), (47, -0.051), (48, 0.105), (49, 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93523991 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck

Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.

2 0.51085579 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

Author: Parikshit Ram, Dongryeol Lee, William March, Alexander G. Gray

Abstract: Several key computational bottlenecks in machine learning involve pairwise distance computations, including all-nearest-neighbors (ďŹ nding the nearest neighbor(s) for each point, e.g. in manifold learning) and kernel summations (e.g. in kernel density estimation or kernel machines). We consider the general, bichromatic case for these problems, in addition to the scientiďŹ c problem of N-body simulation. In this paper we show for the ďŹ rst time O(đ?‘ ) worst case runtimes for practical algorithms for these problems based on the cover tree data structure [1]. 1

3 0.49679595 48 nips-2009-Bootstrapping from Game Tree Search

Author: Joel Veness, David Silver, Alan Blair, William W. Cohen

Abstract: In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel’s checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play. 1

4 0.47874942 129 nips-2009-Learning a Small Mixture of Trees

Author: M. P. Kumar, Daphne Koller

Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.

5 0.44500273 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

Author: Chong Wang, David M. Blei

Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1

6 0.4403443 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions

7 0.42265818 122 nips-2009-Label Selection on Graphs

8 0.40918425 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

9 0.398918 74 nips-2009-Efficient Bregman Range Search

10 0.39725095 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

11 0.39394781 71 nips-2009-Distribution-Calibrated Hierarchical Classification

12 0.39254948 234 nips-2009-Streaming k-means approximation

13 0.38665342 149 nips-2009-Maximin affinity learning of image segmentation

14 0.38082072 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

15 0.3767139 31 nips-2009-An LP View of the M-best MAP problem

16 0.3690396 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

17 0.3570984 95 nips-2009-Fast subtree kernels on graphs

18 0.34188128 182 nips-2009-Optimal Scoring for Unsupervised Learning

19 0.33521944 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

20 0.33371115 51 nips-2009-Clustering sequence sets for motif discovery


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.015), (24, 0.044), (25, 0.057), (35, 0.041), (36, 0.091), (39, 0.058), (58, 0.055), (61, 0.014), (71, 0.04), (72, 0.387), (81, 0.015), (86, 0.061), (91, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77083647 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck

Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.

2 0.5576269 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

3 0.55047899 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1

4 0.49334559 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.

5 0.38128254 129 nips-2009-Learning a Small Mixture of Trees

Author: M. P. Kumar, Daphne Koller

Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.

6 0.38071504 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

7 0.37982395 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

8 0.37791565 112 nips-2009-Human Rademacher Complexity

9 0.37739441 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

10 0.37722498 97 nips-2009-Free energy score space

11 0.37710068 260 nips-2009-Zero-shot Learning with Semantic Output Codes

12 0.37601876 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

13 0.37577638 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

14 0.37558022 72 nips-2009-Distribution Matching for Transduction

15 0.37550804 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

16 0.37470883 157 nips-2009-Multi-Label Prediction via Compressed Sensing

17 0.37443286 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

18 0.37425515 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

19 0.37417841 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

20 0.3739447 70 nips-2009-Discriminative Network Models of Schizophrenia