nips nips2004 nips2004-54 knowledge-graph by maker-knowledge-mining

54 nips-2004-Distributed Information Regularization on Graphs

Source: pdf

Author: Adrian Corduneanu, Tommi S. Jaakkola

Abstract: We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. We analyze the properties of the algorithm and demonstrate its performance on document classiﬁcation tasks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. [sent-5, score-0.457]

2 The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. [sent-6, score-0.595]

3 The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. [sent-7, score-0.454]

4 We analyze the properties of the algorithm and demonstrate its performance on document classiﬁcation tasks. [sent-8, score-0.133]

5 The basic intuition underlying these methods is that the labels should not change within clusters of points, where the deﬁnition of a cluster may vary from one method to another. [sent-10, score-0.156]

6 We provide here an alternative information theoretic criterion and associated algorithms for solving semi-supervised learning problems. [sent-11, score-0.079]

7 Our formulation, an extension of [5, 6], is based on the idea of minimizing the number of bits required to communicate labels for unlabeled points, and involves no parametric assumptions. [sent-12, score-0.287]

8 The communication scheme inherent to the approach is deﬁned in terms of regions, weighted sets of points, that are shared between the sender and the receiver. [sent-13, score-0.207]

9 The regions are important in capturing the topology over the points to be labeled, and, through the communication criterion, bias the labels to be the same within each region. [sent-14, score-0.522]

10 We start by deﬁning the communication game and the associated regularization problem, analyze properties of the regularizer, derive distributed algorithms for ﬁnding the unique solution to the regularization problem, and demonstrate the method on a document classiﬁcation task. [sent-15, score-0.857]

11 Rm R1 … P (R) P (x|R) Q(y|x) … x1 x2 xn−1 xn Figure 1: The topology imposed by the set of regions (squares) on unlabeled points (circles) 2 The communication problem Let S = {x1 , . [sent-16, score-0.592]

12 , xn } be the set of unlabeled points and Y the set of possible labels. [sent-19, score-0.284]

13 We assume that target labels are available only for a small subset Sl ⊂ S of the unlabeled points. [sent-20, score-0.287]

14 The objective here is to ﬁnd a conditional distribution Q(y|x) over the labels at each unlabeled point x ∈ S. [sent-21, score-0.333]

15 The estimation is made possible by a regularization criterion over the conditionals which we deﬁne here through a communication problem. [sent-22, score-0.53]

16 The communication scheme relies on a set of regions R = {R1 , . [sent-23, score-0.305]

17 , Rm }, where each region R ∈ R is a subset of the unlabeled points S (cf. [sent-26, score-0.44]

18 The weights of points within each region are expressed in terms of a conditional distribution P (x|R), x∈R P (x|R) = 1, and each region has an a priori probability P (R). [sent-28, score-0.503]

19 (Note: in our overloaded notation “R” stands both for the set of points and its identity as a set). [sent-30, score-0.09]

20 The regions and the membership probabilities are set in an application speciﬁc manner. [sent-31, score-0.202]

21 For example, in a document classiﬁcation setting we might deﬁne regions as sets of documents containing each word. [sent-32, score-0.403]

22 The probabilities P (R) and P (x|R) could be subsequently derived from a word frequency representation of documents: if f (w|x) is the frequency of word w in document x, then for each pair of w and the corresponding region R we can set P (R) = x∈S f (w|x)/n and P (x|R) = f (w|x)/(nP (R)). [sent-33, score-0.564]

23 For any ﬁxed conditionals {Q(y|x)} we deﬁne the communication problem as follows. [sent-34, score-0.234]

24 The sender selects a region R ∈ R with probability P (R) and a point within the region according to P (x|R). [sent-35, score-0.441]

25 The label y is sampled from Q(y|x) and communicated to the receiver optimally using a coding scheme tailored to the region R (based on knowing P (x|R) and Q(y|x), x ∈ R). [sent-37, score-0.303]

26 The receiver has access to x, R, and the region speciﬁc coding scheme to reproduce y. [sent-38, score-0.303]

27 The rate of information needed to be sent to the receiver in this scheme is given by P (R) P (R)IR (x; y) = Jc (Q; R) = R∈R where Q(y|R) = R∈R x∈R P (x|R)Q(y|x) log x∈R y∈Y Q(y|x) Q(y|R) (1) P (x|R)Q(y|x) is the overall probability of y within the region. [sent-39, score-0.178]

28 3 The regularization problem We use Jc (Q; R) to regularize the conditionals. [sent-40, score-0.255]

29 This regularizer biases the conditional distributions to be constant within each region so as to minimize the communication cost IR (x; y). [sent-41, score-0.439]

30 Put another way, by introducing a region R we bias the points in the region to be labeled the same. [sent-42, score-0.581]

31 The proof follows immediately from the strict convexity of mutual information [7] and the fact that the two conditions guarantee that each Q(y|x) appears non-trivially in at least one mutual information term. [sent-45, score-0.082]

32 4 Regularizer and the number of labelings We consider here a simple setting where the labels are hard and binary, Q(y|x) ∈ {0, 1}, and seek to bound the number of possible binary labelings consistent with a cap on the regularizer. [sent-46, score-0.557]

33 We assume for simplicity that points in a region have uniform weights P (x|R). [sent-47, score-0.277]

34 Let N (I) be the number of labelings of S consistent with an upper bound I on the regularizer Jc (Q, R). [sent-48, score-0.419]

35 Proof Let f (R) be the fraction of positive samples in region R. [sent-51, score-0.226]

36 Because the labels are binary IR (x; y) is given by H(f (R)), where H is the entropy. [sent-52, score-0.165]

37 Since the binary entropy is concave and symmetric w. [sent-54, score-0.082]

38 We say that a region is mainly negative if the former condition holds, or mainly positive if the latter. [sent-59, score-0.505]

39 If two regions R1 and R2 overlap by a large amount, they must be mainly positive or mainly negative together. [sent-60, score-0.483]

40 Then regions in a connected component must be all mainly positive or mainly negative together. [sent-62, score-0.528]

41 We upper bound the number of labelings of the points spanned by a given connected component C, and subsequently combine the bounds. [sent-64, score-0.512]

42 Consider the case in which all regions in C are mainly negative. [sent-65, score-0.287]

43 For any subset C of C that still covers all the points spanned by C, |R| 1 f (C) ≤ gI (R)|R| ≤ max gI (R) · R∈C (3) R |C| |C | R∈C Thus f (C) ≤ t(C) maxR gI (R) where t(C) = minC ∈C, C cover average number of times a point in C is necessarily covered. [sent-66, score-0.12]

44 R∈C |R| |C | is the minimum There at most 2nf (R) log2 (2/f (R)) labelings of a set of points of which at most nf (R) are positive. [sent-67, score-0.264]

45 Thus the number of feasible labelings of the connected component C is upper bounded by 21+nt(C) maxR gI (R) log2 (2/(t(C) maxR gI (R))) where 1 is because C can be either mainly positive or mainly negative. [sent-69, score-0.55]

46 By cumulating the bounds over all connected components and upper bounding the entropy-like term with I/P (R) we achieve the stated result. [sent-70, score-0.093]

47 2 Note that t(R), the average number of times a point is covered by a minimal subcovering of R normally does not scale with |R| and is a covering dependent constant. [sent-71, score-0.076]

48 5 Distributed propagation algorithm We introduce here a local propagation algorithm for minimizing J(Q; λ) that is both easy to implement and provably convergent. [sent-72, score-0.142]

49 The algorithm can be seen as a variant of the BlahutArimoto algorithm in rate-distortion theory [8], adapted to the more structured context here. [sent-73, score-0.085]

50 We can extend the regularizer over both {Q(y|x)} and {QR (y)} by deﬁning Jc (Q, QR ; R) = P (R) R∈R P (x|R)Q(y|x) log x∈R y∈Y Q(y|x) QR (y) (6) so that Jc (Q; R) = min{QR (·),R∈R} Jc (Q, QR ; R) recovers the original regularizer. [sent-75, score-0.223]

51 The local propagation algorithm follows from optimizing each Q(y|x) based on ﬁxed {QR (y)} and subsequently ﬁnding each QR (y) given ﬁxed {Q(y|x)}. [sent-76, score-0.16]

52 In other words, Q(y|x) is obtained by taking a weighted geometric average of the distributions associated with the regions, whereas QR (y) is (as before) a weighted arithmetic average of the conditionals within each region. [sent-78, score-0.219]

53 In terms of the document classiﬁcation example discussed earlier, the weight [nP (R)P (x|R)] appearing in the geometric average reduces to f (w|x), the frequency of word w identiﬁed with region R in document x. [sent-79, score-0.584]

55 While the objective is strictly convex, the solution cannot be written in closed form and have to be found iteratively (e. [sent-81, score-0.08]

56 , via Newton-Raphson or simple bracketing when the labels are binary). [sent-83, score-0.124]

57 A much simpler update Q(y|x) = δ(y, yx ), where yx is the observed label for x, may sufﬁce in practice. [sent-84, score-0.106]

58 1 Extensions Structured labels and generalized propagation steps Here we extend the regularization framework to the case where the labels represent more structured annotations of objects. [sent-87, score-0.697]

59 Let y be a vector of elementary labels y = [y1 , . [sent-88, score-0.165]

60 , yk |x), for any x, can be represented as a tree structured graphical model, where the structure is the same for all x ∈ S. [sent-95, score-0.162]

61 While the regularization principle applies directly if we leave Q(y|x) unconstrained, the calculations would be potentially infeasible due to the number of elementary labels involved, and inefﬁcient as we would not explicitly make use of the assumed structure. [sent-99, score-0.461]

63 The regularization problem will be formulated over {Qi (yi |x), Qij (yi , yj |x)} rather than unconstrained Q(y|x). [sent-101, score-0.384]

64 The difﬁculty in this case arises from the fact that the arithmetic average (mixing) in eq (8) is not structure preserving (tree structured models are not mean ﬂat). [sent-102, score-0.239]

65 By restricting the class of variational distributions QR (y) that we consider, we necessarily obtain an upper bound on the original information criterion. [sent-104, score-0.092]

66 The new updates will result in a monotonically decreasing bound on the original criterion. [sent-106, score-0.078]

67 8 1 Figure 2: Clusters correctly separated by information regularization given one label from each class 6. [sent-123, score-0.255]

68 2 Complementary sets of regions In many cases the points to be labeled may have alternative feature representations, each leading to a different set of natural regions R(k) . [sent-124, score-0.537]

69 For example, in web page classiﬁcation both the content of the page, and the type of documents that link to that page should be correlated with its topic. [sent-125, score-0.492]

70 The relationship between these heterogeneous features may be complex, with some features more relevant to the classiﬁcation task than others. [sent-126, score-0.153]

71 Let Jc (Q; R(k) ) denote the regularizer from the k th feature representation. [sent-127, score-0.153]

72 The result is a regularizer with regions K = ∪k R(k) and adjusted a priori weights αk Pk (R) over the regions. [sent-129, score-0.357]

73 This gives rise to the following regularization problem: max αk ≥0, min J(Q; λ, α) αk =1 Q(y|x) (14) where J(Q; λ, α) is the overall objective that uses Jc (Q; K, α) as the regularizer. [sent-133, score-0.331]

74 The maximum is well-deﬁned since the objective is concave in {αk }. [sent-134, score-0.087]

75 7 Experiments We ﬁrst illustrate the performance of information regularization on two generated binary classiﬁcation tasks in the plane. [sent-138, score-0.296]

76 Here we can derive a region covering from the Euclidean metric as spheres of a certain radius centered at each data point. [sent-139, score-0.263]

77 On the data set in Figure 2 inspired from [3] the method correctly propagates the labels to the clusters starting 1 1 0. [sent-140, score-0.156]

78 9 1 Figure 3: Ability of information regularization to correct the output of a prior classiﬁer (left: before, right: after) from a single labeled point in each class. [sent-176, score-0.372]

79 In the example in Figure 3 we demonstrate that information regularization can be used as a post-processing to supervised classiﬁcation and improve error rates by taking advantage of the topology of the space. [sent-177, score-0.299]

80 All points are a priori labeled by a linear classiﬁer that is non-optimal and places a decision boundary through the negative and positive clusters. [sent-178, score-0.32]

81 Information regularization (on a Euclidean region covering deﬁned as circles around each data point) is able to correct the mislabeling of the clusters. [sent-179, score-0.549]

82 Next we test the algorithm on a web document classiﬁcation task, the WebKB data set of [1]. [sent-180, score-0.18]

83 The dataset is interesting because apart from the documents contents we have information about the link structure of the documents. [sent-184, score-0.271]

84 The two sources of information can illustrate the capability of information regularization of combining heterogeneous unlabeled representations. [sent-185, score-0.491]

85 To obtain ’link’ features we collect text that appears under all links that link to that page from other pages, and produce its bag-of-words representation. [sent-187, score-0.385]

86 We employ no stemming, or stop-word processing, but restrict the vocabulary to 2000 text words and 500 link words. [sent-188, score-0.313]

87 We report a na¨ve Bayes baseline based on the model that features of different words are independent ı given the document class. [sent-191, score-0.222]

88 The na¨ve Bayes algorithm can be run on text features, link ı features, or combine the two feature sets by assuming independence. [sent-192, score-0.273]

89 The key issue in applying information regularization is the derivation of a sound region covering R. [sent-197, score-0.518]

90 For document classiﬁcation we obtained the best results by grouping all documents that share a certain word into the same region; thus each region is in fact a word, and there are as many regions as the size of the vocabulary. [sent-198, score-0.687]

91 When running information regu- Table 1: Web page classiﬁcation comparison between na¨ve Bayes and information reguı larization and semi-supervised na¨ ve Bayes+EM on text, link, and joint features ı text link both na¨ve Bayes ı 82. [sent-202, score-0.553]

92 01 larization with both text and link features we combined the coverings with a weight of 0. [sent-211, score-0.365]

93 We observe that information regularization performs better than na¨ve Bayes on all types of features, that combining text and link features improves ı performance of the regularization method, and that on link features the method performs better than the semi-supervised na¨ ve Bayes+EM. [sent-214, score-1.163]

94 Most likely the results do not reﬂect the ı full potential of information regularization due to the ad-hoc choice of regions based on the vocabulary used by na¨ve Bayes. [sent-215, score-0.458]

95 ı 8 Discussion The regularization principle introduced here provides a general information theoretic approach to exploiting unlabeled points. [sent-216, score-0.497]

96 The solution implied by the principle is unique and can be found efﬁciently with distributed algorithms, performing complementary averages, on the graph induced by the regions. [sent-217, score-0.218]

97 The propagation algorithms also extend to more structured settings. [sent-218, score-0.194]

98 Our preliminary theoretical analysis concerning the number of possible labelings with bounded regularizer is suggestive but rather loose (tighter results can be found). [sent-219, score-0.327]

99 The effect of the choice of the regions (sets of points that ought to be labeled the same) is critical in practice but not yet well-understood. [sent-220, score-0.372]

100 Text classiﬁcation from labeled and unlabeled documents using EM. [sent-273, score-0.385]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qr', 0.369), ('jc', 0.351), ('regularization', 0.255), ('na', 0.205), ('region', 0.187), ('labelings', 0.174), ('link', 0.166), ('regions', 0.165), ('unlabeled', 0.163), ('regularizer', 0.153), ('sl', 0.146), ('conditionals', 0.135), ('document', 0.133), ('labels', 0.124), ('mainly', 0.122), ('ir', 0.119), ('labeled', 0.117), ('documents', 0.105), ('ve', 0.101), ('communication', 0.099), ('yj', 0.097), ('points', 0.09), ('maxr', 0.088), ('qij', 0.088), ('page', 0.087), ('structured', 0.085), ('bayes', 0.082), ('gi', 0.081), ('covering', 0.076), ('text', 0.076), ('yi', 0.076), ('receiver', 0.075), ('propagation', 0.071), ('larization', 0.067), ('sender', 0.067), ('word', 0.067), ('np', 0.064), ('gr', 0.064), ('eq', 0.064), ('classi', 0.063), ('complementary', 0.062), ('zx', 0.059), ('webkb', 0.059), ('features', 0.056), ('yx', 0.053), ('corduneanu', 0.053), ('tommi', 0.053), ('csail', 0.05), ('arithmetic', 0.05), ('szummer', 0.05), ('subsequently', 0.05), ('unique', 0.048), ('upper', 0.048), ('web', 0.047), ('objective', 0.046), ('connected', 0.045), ('topology', 0.044), ('bound', 0.044), ('qi', 0.043), ('cation', 0.043), ('principle', 0.041), ('tree', 0.041), ('criterion', 0.041), ('concave', 0.041), ('heterogeneous', 0.041), ('elementary', 0.041), ('binary', 0.041), ('mutual', 0.041), ('scheme', 0.041), ('pk', 0.04), ('qt', 0.04), ('preserving', 0.04), ('optimizing', 0.039), ('positive', 0.039), ('priori', 0.039), ('theoretic', 0.038), ('vocabulary', 0.038), ('extend', 0.038), ('membership', 0.037), ('yk', 0.036), ('negative', 0.035), ('geometric', 0.034), ('updates', 0.034), ('solution', 0.034), ('distributed', 0.033), ('words', 0.033), ('log', 0.032), ('unconstrained', 0.032), ('combining', 0.032), ('clusters', 0.032), ('circles', 0.031), ('xn', 0.031), ('combine', 0.031), ('em', 0.03), ('spanned', 0.03), ('share', 0.03), ('overall', 0.03), ('frequency', 0.03), ('rewriting', 0.029), ('biasing', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 54 nips-2004-Distributed Information Regularization on Graphs

Author: Adrian Corduneanu, Tommi S. Jaakkola

2 0.19256802 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution beneﬁts from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are deﬁnitely in favor of minimum entropy regularization when generative models are misspeciﬁed, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

3 0.12733054 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie

Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.

4 0.12417467 42 nips-2004-Computing regularization paths for learning multiple kernels

Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan

Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classiﬁcation or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1

5 0.11569349 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1

6 0.10747682 165 nips-2004-Semi-supervised Learning on Directed Graphs

7 0.10583965 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

8 0.10161845 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

9 0.099513143 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

10 0.098123752 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

11 0.097744592 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

12 0.094578147 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

13 0.094265677 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

14 0.088540874 115 nips-2004-Maximum Margin Clustering

15 0.087201558 82 nips-2004-Incremental Algorithms for Hierarchical Classification

16 0.08676134 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

17 0.082228906 166 nips-2004-Semi-supervised Learning via Gaussian Processes

18 0.078972876 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

19 0.077958241 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

20 0.077000812 136 nips-2004-On Semi-Supervised Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.257), (1, 0.109), (2, -0.041), (3, 0.08), (4, 0.002), (5, 0.092), (6, 0.01), (7, 0.102), (8, 0.032), (9, -0.056), (10, 0.077), (11, 0.066), (12, -0.01), (13, 0.028), (14, -0.077), (15, 0.045), (16, 0.167), (17, -0.157), (18, 0.063), (19, -0.012), (20, 0.105), (21, -0.033), (22, 0.188), (23, 0.033), (24, -0.049), (25, 0.041), (26, -0.093), (27, -0.085), (28, -0.049), (29, 0.015), (30, 0.084), (31, 0.043), (32, 0.055), (33, 0.152), (34, 0.07), (35, -0.049), (36, 0.092), (37, 0.01), (38, -0.002), (39, -0.122), (40, 0.13), (41, 0.141), (42, -0.044), (43, 0.004), (44, -0.007), (45, -0.043), (46, -0.072), (47, -0.027), (48, -0.046), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95958769 54 nips-2004-Distributed Information Regularization on Graphs

Author: Adrian Corduneanu, Tommi S. Jaakkola

2 0.65398192 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

3 0.63646322 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie

4 0.5365063 165 nips-2004-Semi-supervised Learning on Directed Graphs

Author: Dengyong Zhou, Thomas Hofmann, Bernhard Schölkopf

Abstract: Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions deﬁned over nodes of a directed graph that forces the classiﬁcation function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classiﬁcation algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classiﬁcation problems demonstrates encouraging results that validate our approach. 1

5 0.521819 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

6 0.49485394 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

7 0.488098 166 nips-2004-Semi-supervised Learning via Gaussian Processes

8 0.4739638 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

9 0.46955633 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

10 0.45645317 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

11 0.43747798 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

12 0.42711565 42 nips-2004-Computing regularization paths for learning multiple kernels

13 0.41695213 145 nips-2004-Parametric Embedding for Class Visualization

14 0.41555515 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

15 0.40653488 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

16 0.40610096 109 nips-2004-Mass Meta-analysis in Talairach Space

17 0.40553519 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

18 0.39782164 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

19 0.37703836 136 nips-2004-On Semi-Supervised Classification

20 0.37587297 127 nips-2004-Neighbourhood Components Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.1), (15, 0.126), (26, 0.053), (31, 0.052), (33, 0.244), (35, 0.017), (39, 0.018), (50, 0.028), (77, 0.012), (87, 0.277)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90431035 17 nips-2004-Adaptive Manifold Learning

Author: Jing Wang, Zhenyue Zhang, Hongyuan Zha

Abstract: Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better ﬁtting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets. 1

2 0.90278459 100 nips-2004-Learning Preferences for Multiclass Problems

Author: Fabio Aiolli, Alessandro Sperduti

Abstract: Many interesting multiclass problems can be cast in the general framework of label ranking deﬁned on a given set of classes. The evaluation for such a ranking is generally given in terms of the number of violated order constraints between classes. In this paper, we propose the Preference Learning Model as a unifying framework to model and solve a large class of multiclass problems in a large margin perspective. In addition, an original kernel-based method is proposed and evaluated on a ranking dataset with state-of-the-art results. 1

3 0.9001472 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

Author: Hyun J. Park, Te W. Lee

Abstract: Capturing dependencies in images in an unsupervised manner is important for many image processing applications. We propose a new method for capturing nonlinear dependencies in images of natural scenes. This method is an extension of the linear Independent Component Analysis (ICA) method by building a hierarchical model based on ICA and mixture of Laplacian distribution. The model parameters are learned via an EM algorithm and it can accurately capture variance correlation and other high order structures in a simple manner. We visualize the learned variance structure and demonstrate applications to image segmentation and denoising. 1 In trod u ction Unsupervised learning has become an important tool for understanding biological information processing and building intelligent signal processing methods. Real biological systems however are much more robust and flexible than current artificial intelligence mostly due to a much more efficient representations used in biological systems. Therefore, unsupervised learning algorithms that capture more sophisticated representations can provide a better understanding of neural information processing and also provide improved algorithm for signal processing applications. For example, independent component analysis (ICA) can learn representations similar to simple cell receptive fields in visual cortex [1] and is also applied for feature extraction, image segmentation and denoising [2,3]. ICA can approximate statistics of natural image patches by Eq.(1,2), where X is the data and u is a source signal whose distribution is a product of sparse distributions like a generalized Laplacian distribution. X = Au (1) P (u ) = ∏ P (u i ) (2) But the representation learned by the ICA algorithm is relatively low-level. In biological systems there are more high-level representations such as contours, textures and objects, which are not well represented by the linear ICA model. ICA learns only linear dependency between pixels by finding strongly correlated linear axis. Therefore, the modeling capability of ICA is quite limited. Previous approaches showed that one can learn more sophisticated high-level representations by capturing nonlinear dependencies in a post-processing step after the ICA step [4,5,6,7,8]. The focus of these efforts has centered on variance correlation in natural images. After ICA, a source signal is not linearly predictable from others. However, given variance dependencies, a source signal is still ‘predictable’ in a nonlinear manner. It is not possible to de-correlate this variance dependency using a linear transformation. Several researchers have proposed extensions to capture the nonlinear dependencies. Portilla et al. used Gaussian Scale Mixture (GSM) to model variance dependency in wavelet domain. This model can learn variance correlation in source prior and showed improvement in image denoising [4]. But in this model, dependency is defined only between a subset of wavelet coefficients. Hyvarinen and Hoyer suggested using a special variance related distribution to model the variance correlated source prior. This model can learn grouping of dependent sources (Subspace ICA) or topographic arrangements of correlated sources (Topographic ICA) [5,6]. Similarly, Welling et al. suggested a product of expert model where each expert represents a variance correlated group [7]. The product form of the model enables applications to image denoising. But these models don’t reveal higher-order structures explicitly. Our model is motivated by Lewicki and Karklin who proposed a 2-stage model where the 1st stage is an ICA model (Eq. (3)) and the 2 nd-stage is a linear generative model where another source v generates logarithmic variance for the 1st stage (Eq. (4)) [8]. This model captures variance dependency structure explicitly, but treating variance as an additional random variable introduces another level of complexity and requires several approximations. Thus, it is difficult to obtain a simple analytic PDF of source signal u and to apply the model for image processing problems. ( P (u | λ ) = c exp − u / λ q ) (3) log[λ ] = Bv (4) We propose a hierarchical model based on ICA and a mixture of Laplacian distribution. Our model can be considered as a simplification of model in [8] by constraining v to be 0/1 random vector where only one element can be 1. Our model is computationally simpler but still can capture variance dependency. Experiments show that our model can reveal higher order structures similar to [8]. In addition, our model provides a simple parametric PDF of variance correlated priors, which is an important advantage for adaptive signal processing. Utilizing this, we demonstrate simple applications on image segmentation and image denoising. Our model provides an improved statistic model for natural images and can be used for other applications including feature extraction, image coding, or learning even higher order structures. 2 Modeling nonlinear dependencies We propose a hierarchical or 2-stage model where the 1 st stage is an ICA source signal model and the 2nd stage is modeled by a mixture model with different variances (figure 1). In natural images, the correlation of variance reflects different types of regularities in the real world. Such specialized regularities can be summarized as “context” information. To model the context dependent variance correlation, we use mixture models where Laplacian distributions with different variance represent different contexts. For each image patch, a context variable Z “selects” which Laplacian distribution will represent ICA source signal u. Laplacian distributions have 0-mean but different variances. The advantage of Laplacian distribution for modeling context is that we can model a sparse distribution using only one Laplacian distribution. But we need more than two Gaussian distributions to do the same thing. Also conventional ICA is a special case of our model with one Laplacian. We define the mixture model and its learning algorithm in the next sections. Figure 1: Proposed hierarchical model (1st stage is ICA generative model. 2nd stage is mixture of “context dependent” Laplacian distributions which model U. Z is a random variable that selects a Laplacian distribution that generates the given image patch) 2.1 Mixture of Laplacian Distribution We define a PDF for mixture of M-dimensional Laplacian Distribution as Eq.(5), where N is the number of data samples, and K is the number of mixtures. N N K M N K r r r P(U | Λ, Π) = ∏ P(u n | Λ, Π) = ∏∑ π k P(u n | λk ) = ∏∑ π k ∏ n n k n k m 1 (2λ ) k ,m  u n,m exp −  λk , m      (5) r r r r r un = (un,1 , un , 2 , , , un,M ) : n-th data sample, U = (u1 , u 2 , , , ui , , , u N ) r r r r r λk = (λk ,1 , λk , 2 ,..., λk ,M ) : Variance of k-th Laplacian distribution, Λ = (λ1 , λ2 , , , λk , , , λK ) πk : probability of Laplacian distribution k, Π = (π 1 , , , π K ) and ∑ k πk =1 It is not easy to maximize Eq.(5) directly, and we use EM (expectation maximization) algorithm for parameter estimation. Here we introduce a new hidden context variable Z that represents which Laplacian k, is responsible for a given data point. Assuming we know the hidden variable Z, we can write the likelihood of data and Z as Eq.(6), n zk K   N r  (π )zkn   1  ⋅ exp − z n u n ,m   P(U , Z | Λ, Π ) = ∏ P(u n , Z | Λ, Π ) = ∏ ∏ k ∏      k   k λk , m n n m   2λk ,m        N               (6) n z k : Hidden binary random variable, 1 if n-th data sample is generated from k-th n Laplacian, 0 other wise. ( Z = (z kn ) and ∑ z k = 1 for all n = 1…N) k 2.2 EM algorithm for learning the mixture model The EM algorithm maximizes the log likelihood of data averaged over hidden variable Z. The log likelihood and its expectation can be computed as in Eq.(7,8).   u 1 n n log P(U , Z | Λ, Π ) = ∑  z k log(π k ) + ∑ z k  log( ) − n ,m  2λk ,m λk , m n ,k  m       (7)   u 1 n E {log P (U , Z | Λ, Π )} = ∑ E z k log(π k ) + ∑  log( ) − n ,m  2λ k , m λk , m n ,k m    { }     (8) The expectation in Eq.(8) can be evaluated, if we are given the data U and estimated parameters Λ and Π. For Λ and Π, EM algorithm uses current estimation Λ’ and Π’. { } { } ∑ z P( z n n E z k ≡ E zk | U , Λ' , Π ' = 1 n z k =0 n k n k n | u n , Λ' , Π ' ) = P( z k = 1 | u n , Λ' , Π ' ) (9) = n n P (u n | z k = 1, Λ' , Π ' ) P( z k = 1 | Λ ' , Π ' ) P(u n | Λ' , Π ' ) = M u n ,m 1 1 1 ∏ 2λ ' exp(− λ ' ) ⋅ π k ' = c P (u n | Λ ' , Π ' ) m k ,m k ,m n M πk ' ∏ 2λ m k ,m ' exp(− u n ,m λk , m ' ) Where the normalization constant can be computed as K K M k k =1 m =1 n cn = P (u n | Λ ' , Π ' ) = ∑ P (u n | z k , Λ ' , Π ' ) P ( z kn | Λ ' , Π ' ) = ∑ π k ∏ 1 (2λ ) exp( − k ,m u n ,m λk ,m ) (10) The EM algorithm works by maximizing Eq.(8), given the expectation computed from Eq.(9,10). Eq.(9,10) can be computed using Λ’ and Π’ estimated in the previous iteration of EM algorithm. This is E-step of EM algorithm. Then in M-step of EM algorithm, we need to maximize Eq.(8) over parameter Λ and Π. First, we can maximize Eq.(8) with respect to Λ, by setting the derivative as 0.  1 u n,m  ∂E{log P (U , Z | Λ, Π )} n  = 0 = ∑ E z k  − +  λ k , m (λ k , m ) 2   ∂λ k ,m  n   { } ⇒ λ k ,m ∑ E{z }⋅ u = ∑ E{z } n k n ,m n (11) n k n Second, for maximization of Eq.(8) with respect to Π, we can rewrite Eq.(8) as below. n (12) E {log P (U , Z | Λ , Π )} = C + ∑ E {z k ' }log(π k ' ) n ,k ' As we see, the derivative of Eq.(12) with respect to Π cannot be 0. Instead, we need to use Lagrange multiplier method for maximization. A Lagrange function can be defined as Eq.(14) where ρ is a Lagrange multiplier. { } (13) n L (Π , ρ ) = − ∑ E z k ' log(π k ' ) + ρ (∑ π k ' − 1) n,k ' k' By setting the derivative of Eq.(13) to be 0 with respect to ρ and Π, we can simply get the maximization solution with respect to Π. We just show the solution in Eq.(14). ∂L(Π, ρ ) ∂L(Π, ρ ) =0 = 0, ∂Π ∂ρ  n   n  ⇒ π k =  ∑ E z k  /  ∑∑ E z k     k n  n { } { } (14) Then the EM algorithm can be summarized as figure 2. For the convergence criteria, we can use the expectation of log likelihood, which can be calculated from Eq. (8). πk = { } , λk , m = E um + e (e is small random noise) 2. Calculate the Expectation by 1. Initialize 1 K u n ,m 1 M πk ' ∏ 2λ ' exp( − λ ' ) cn m k ,m k ,m 3. Maximize the log likelihood given the Expectation { } { } n n E z k ≡ E zk | U , Λ' , Π ' =     λk ,m ←  ∑ E {z kn }⋅ u n,m  /  ∑ E {z kn } ,     π k ←  ∑ E {z kn } /  ∑∑ E {z kn }   n   n   k n  4. If (converged) stop, otherwise repeat from step 2.  n Figure 2: Outline of EM algorithm for Learning the Mixture Model 3 Experimental Results Here we provide examples of image data and show how the learning procedure is performed for the mixture model. We also provide visualization of learned variances that reveal the structure of variance correlation and an application to image denoising. 3.1 Learning Nonlinear Dependencies in Natural images As shown in figure 1, the 1 st stage of the proposed model is simply the linear ICA. The ICA matrix A and W(=A-1) are learned by the FastICA algorithm [9]. We sampled 105(=N) data from 16x16 patches (256 dim.) of natural images and use them for both first and second stage learning. ICA input dimension is 256, and source dimension is set to be 160(=M). The learned ICA basis is partially shown in figure 1. The 2nd stage mixture model is learned given the ICA source signals. In the 2 nd stage the number of mixtures is set to 16, 64, or 256(=K). Training by the EM algorithm is fast and several hundred iterations are sufficient for convergence (0.5 hour on a 1.7GHz Pentium PC). For the visualization of learned variance, we adapted the visualization method from [8]. Each dimension of ICA source signal corresponds to an ICA basis (columns of A) and each ICA basis is localized in both image and frequency space. Then for each Laplacian distribution, we can display its variance vector as a set of points in image and frequency space. Each point can be color coded by variance value as figure 3. (a1) (a2) (b1) (b2) Figure 3: Visualization of learned variances (a1 and a2 visualize variance of Laplacian #4 and b1 and 2 show that of Laplacian #5. High variance value is mapped to red color and low variance is mapped to blue. In Laplacian #4, variances for diagonally oriented edges are high. But in Laplacian #5, variances for edges at spatially right position are high. Variance structures are related to “contexts” in the image. For example, Laplacian #4 explains image patches that have oriented textures or edges. Laplacian #5 captures patches where left side of the patch is clean but right side is filled with randomly oriented edges.) A key idea of our model is that we can mix up independent distributions to get nonlinearly dependent distribution. This modeling power can be shown by figure 4. Figure 4: Joint distribution of nonlinearly dependent sources. ((a) is a joint histogram of 2 ICA sources, (b) is computed from learned mixture model, and (c) is from learned Laplacian model. In (a), variance of u2 is smaller than u1 at center area (arrow A), but almost equal to u1 at outside (arrow B). So the variance of u2 is dependent on u1. This nonlinear dependency is closely approximated by mixture model in (b), but not in (c).) 3.2 Unsupervised Image Segmentation The idea behind our model is that the image can be modeled as mixture of different variance correlated “contexts”. We show how the learned model can be used to classify different context by an unsupervised image segmentation task. Given learned model and data, we can compute the expectation of a hidden variable Z from Eq. (9). Then for an image patch, we can select a Laplacian distribution with highest probability, which is the most explaining Laplacian or “context”. For segmentation, we use the model with 16 Laplacians. This enables abstract partitioning of images and we can visualize organization of images more clearly (figure 5). Figure 5: Unsupervised image segmentation (left is original image, middle is color labeled image, right image shows color coded Laplacians with variance structure. Each color corresponds to a Laplacian distribution, which represents surface or textural organization of underlying contexts. Laplacian #14 captures smooth surface and Laplacian #9 captures contrast between clear sky and textured ground scenes.) 3.3 Application to Image Restoration The proposed mixture model provides a better parametric model of the ICA source distribution and hence an improved model of the image structure. An advantage is in the MAP (maximum a posterior) estimation of a noisy image. If we assume Gaussian noise n, the image generation model can be written as Eq.(15). Then, we can compute MAP estimation of ICA source signal u by Eq.(16) and reconstruct the original image. (15) X = Au + n (16) ˆ u = argmax log P (u | X , A) = argmax (log P ( X | u , A) + log P (u ) ) u u Since we assumed Gaussian noise, P(X|u,A) in Eq. (16) is Gaussian. P(u) in Eq. (16) can be modeled as a Laplacian or a mixture of Laplacian distribution. The mixture distribution can be approximated by a maximum explaining Laplacian. We evaluated 3 different methods for image restoration including ICA MAP estimation with simple Laplacian prior, same with Laplacian mixture prior, and the Wiener filter. Figure 6 shows an example and figure 7 summarizes the results obtained with different noise levels. As shown MAP estimation with the mixture prior performs better than the others in terms of SNR and SSIM (Structural Similarity Measure) [10]. Figure 6: Image restoration results (signal variance 1.0, noise variance 0.81) 16 ICA MAP (Mixture prior) ICA MAP (Laplacian prior) W iener 14 0.8 SSIM Index SNR 12 10 8 6 0.6 0.4 0.2 4 2 ICA MAP(Mixture prior) ICA MAP(Laplacian prior) W iener Noisy Image 1 0 0.5 1 1.5 Noise variance 2 2.5 0 0 0.5 1 1.5 Noise variance 2 2.5 Figure 7: SNR and SSIM for 3 different algorithms (signal variance = 1.0) 4 D i s c u s s i on We proposed a mixture model to learn nonlinear dependencies of ICA source signals for natural images. The proposed mixture of Laplacian distribution model is a generalization of the conventional independent source priors and can model variance dependency given natural image signals. Experiments show that the proposed model can learn the variance correlated signals grouped as different mixtures and learn highlevel structures, which are highly correlated with the underlying physical properties captured in the image. Our model provides an analytic prior of nearly independent and variance-correlated signals, which was not viable in previous models [4,5,6,7,8]. The learned variances of the mixture model show structured localization in image and frequency space, which are similar to the result in [8]. Since the model is given no information about the spatial location or frequency of the source signals, we can assume that the dependency captured by the mixture model reveals regularity in the natural images. As shown in image labeling experiments, such regularities correspond to specific surface types (textures) or boundaries between surfaces. The learned mixture model can be used to discover hidden contexts that generated such regularity or correlated signal groups. Experiments also show that the labeling of image patches is highly correlated with the object surface types shown in the image. The segmentation results show regularity across image space and strong correlation with high-level concepts. Finally, we showed applications of the model for image restoration. We compare the performance with the conventional ICA MAP estimation and Wiener filter. Our results suggest that the proposed model outperforms other traditional methods. It is due to the estimation of the correlated variance structure, which provides an improved prior that has not been considered in other methods. In our future work, we plan to exploit the regularity of the image segmentation result to lean more high-level structures by building additional hierarchies on the current model. Furthermore, the application to image coding seems promising. References [1] A. J. Bell and T. J. Sejnowski, The ‘Independent Components’ of Natural Scenes are Edge Filters, Vision Research, 37(23):3327–3338, 1997. [2] A. Hyvarinen, Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation,Neural Computation, 11(7):1739-1768, 1999. [3] T. Lee, M. Lewicki, and T. Sejnowski., ICA Mixture Models for unsupervised Classification of non-gaussian classes and automatic context switching in blind separation. PAMI, 22(10), October 2000. [4] J. Portilla, V. Strela, M. J. Wainwright and E. P Simoncelli, Image Denoising using Scale Mixtures of Gaussians in the Wavelet Domain, IEEE Trans. On Image Processing, Vol.12, No. 11, 1338-1351, 2003. [5] A. Hyvarinen, P. O. Hoyer. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neurocomputing, 1999. [6] A. Hyvarinen, P.O. Hoyer, Topographic Independent component analysis as a model of V1 Receptive Fields, Neurocomputing, Vol. 38-40, June 2001. [7] M. Welling and G. E. Hinton, S. Osindero, Learning Sparse Topographic Representations with Products of Student-t Distributions, NIPS, 2002. [8] M. S. Lewicki and Y. Karklin, Learning higher-order structures in natural images, Network: Comput. Neural Syst. 14 (August 2003) 483-499. [9] A.Hyvarinen, P.O. Hoyer, Fast ICA matlab code., http://www.cis.hut.fi/projects/compneuro/extensions.html/ [10] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, The SSIM Index for Image Quality Assessment, IEEE Transactions on Image Processing, vol. 13, no. 4, Apr. 2004.

same-paper 4 0.87132055 54 nips-2004-Distributed Information Regularization on Graphs

Author: Adrian Corduneanu, Tommi S. Jaakkola

5 0.7625792 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of ﬁlter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which ﬁlter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efﬁcacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reﬂections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear ﬁlter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive ﬁeld development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive ﬁeld-like ﬁlters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and ﬁnance.16 Many approaches to the unsupervised speciﬁcation of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear ﬁlters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to ﬁnd combinations of these (perhaps non-linearly transformed) as a way of ﬁnding higher order ﬁlters (eg complex cells). One critical facet whose speciﬁcation from data is not obvious is the neighborhood arrangement, ie which linear ﬁlters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for ﬁnding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, ﬁlter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in ﬁgure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n ﬁlter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modiﬁed Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to ﬁlter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM speciﬁes a hierarchy of mixer variables. Wainwright2 considered a prespeciﬁed tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being ﬁlter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each ﬁlter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic ﬁlter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to ﬁll the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of ﬁlter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by deﬁnition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the ﬁlters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear ﬁlters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which ﬁlter responses share which mixer variables. We ﬁrst study this issue using synthetic data in which two groups of ﬁlter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (ﬁgure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few ﬁlter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single ﬁlter response, the Gaussian estimate peaks away from zero. For assignments including more ﬁlter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including ﬁlter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each ﬁlter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each ﬁlter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of ﬁlter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of ﬁlter associations corresponding to vα , vβ , and vγ . C Statistics of generated ﬁlter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of ﬁlter responses increases, the estimates improve, provided that they are taken from the right group of ﬁlter responses with the same mixer variable. Speciﬁcally, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single ﬁlter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many ﬁlter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefﬁcients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in ﬁgure 1 suggest that it should be possible to infer the assignments, ie work out which ﬁlter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each ﬁlter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between ﬁlter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that ﬁlter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of ﬁlter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a ﬁlter response li , we use Gibbs sampling for the E phase to ﬁnd possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple ﬁlter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the ﬁlter response samples just to those associated with each mixer variable. We tested the ability of this inference method to ﬁnd the associations in the probabilistic mixer variable synthetic example shown in ﬁgure 2, (A,B). The true generative model speciﬁes probabilistic overlap of 3 mixer variables. We generated 5000 samples for each ﬁlter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the ﬁlter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently ﬁnds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefﬁcient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear ﬁlters from a multi-scale oriented steerable pyramid,28 with 100 ﬁlters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in ﬁgure 2). Figure 3 depicts the association weights pij of the coefﬁcients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefﬁcients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefﬁcient to be generated from a given mixer variable. For the ﬁrst two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The ﬁrst neighborhood (in B) prefers vertical patterns across most of its “receptive ﬁeld”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefﬁcient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal ﬁlters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with ﬁlter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between ﬁlters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the ﬁrst 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefﬁcients of two spatially displaced vertical ﬁlters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet ﬁlters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA ﬁrst-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet ﬁlters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefﬁcients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In ﬁgure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial conﬁgurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial conﬁguration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal conﬁguration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefﬁcients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to ﬁlter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method ﬁrmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent ﬂows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefﬁcients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artiﬁcial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge ﬁlters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

6 0.74590975 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

7 0.74384433 145 nips-2004-Parametric Embedding for Class Visualization

8 0.74321598 77 nips-2004-Hierarchical Clustering of a Mixture Model

9 0.74259663 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

10 0.73397112 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

11 0.72947282 102 nips-2004-Learning first-order Markov models for control

12 0.72880423 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

13 0.72598207 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

14 0.72581267 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

15 0.72563678 44 nips-2004-Conditional Random Fields for Object Recognition

16 0.72550333 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

17 0.72537327 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

18 0.72415084 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

19 0.72367513 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

20 0.72337651 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach