nips nips2004 nips2004-32 knowledge-graph by maker-knowledge-mining

32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Source: pdf

Author: Ligen Wang, Balázs Kégl

Abstract: In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve A DA B OOST by incorporating knowledge on the structure of the data into base classiﬁer design and selection. On the other hand, we use A DA B OOST’s efﬁcient learning mechanism to signiﬁcantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. Beside the speciﬁc manifold-based penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Boosting on manifolds: adaptive regularization of base classiﬁers Bal´ zs K´ gl and Ligen Wang a e Department of Computer Science and Operations Research, University of Montreal CP 6128 succ. [sent-1, score-0.498]

2 ca Abstract In this paper we propose to combine two powerful ideas, boosting and manifold learning. [sent-4, score-0.195]

3 On the one hand, we improve A DA B OOST by incorporating knowledge on the structure of the data into base classiﬁer design and selection. [sent-5, score-0.449]

4 On the other hand, we use A DA B OOST’s efﬁcient learning mechanism to signiﬁcantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. [sent-6, score-0.112]

5 Beside the speciﬁc manifold-based penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms. [sent-7, score-0.236]

6 The algorithm constructs a weighted linear combination of simple base classiﬁers in an iterative fashion. [sent-9, score-0.479]

7 One of the remarkable properties of A DA B OOST is that it is relatively immune to overﬁtting even after the training error has been driven to zero. [sent-10, score-0.117]

8 Rather than focusing on possibly noisy data points, we attempt to achieve regularization by favoring base classiﬁers that are smooth in a certain sense. [sent-16, score-0.464]

9 A DA B OOST will compare base classiﬁers based on solely their weighted errors so, implicitly, it will consider every base classiﬁer having the same (usually low) complexity. [sent-19, score-0.911]

10 On the other hand, intuitively, we may hope to achieve better generalization if we prefer base classiﬁers that “cut through” sparse regions to base classiﬁers that cut into “natural” clusters or cut the manifold several times. [sent-20, score-1.045]

11 To formalize this intuition, we use the graph Laplacian regularizer proposed in connection to manifold learning [4] and spectral clustering [5] (Section 3). [sent-21, score-0.178]

12 For binary base classiﬁers, this penalty is proportional to the number of edges of the neighborhood graph that the classiﬁer cuts (Figure 1(b)). [sent-22, score-0.71]

13 (b) The graph Laplacian penalty is proportional to the number of separated neighbors. [sent-24, score-0.212]

14 To incorporate this adaptive penalization of base classiﬁers into A DA B OOST, we will turn to the marginal A DA B OOST algorithm [6] also known as arc-gv [7]. [sent-25, score-0.535]

15 This algorithm can be interpreted as A DA B OOST with an L1 weight decay on the base classiﬁer coefﬁcients with a weight decay coefﬁcient θ. [sent-26, score-0.594]

16 The algorithm has been used to maximize the hard margin on the data [7, 6] and also for regularization [3]. [sent-27, score-0.106]

17 The coefﬁcient θ is adaptive in all these applications: in [7] and [6] it depends on the hard margin and the weighted error, respectively, whereas in [3] it is different for every training point and it quantiﬁes the “noisiness” of the points. [sent-28, score-0.121]

18 The idea of this paper is to make θ dependent on the individual base classiﬁers, in particular, to set θ to the regularization penalty of the base classiﬁer. [sent-29, score-1.04]

19 First, with this choice, the objective of base learning becomes standard regularized error minimization so the proposed algorithm accommodates the boosting of a large family of regularized learning algorithms. [sent-30, score-0.843]

20 Second, the coefﬁcients of the base classiﬁers are lowered proportionally with their complexity, which can be interpreted as an adaptive weight decay. [sent-31, score-0.478]

21 Experimental results (Section 4) show that the regularized algorithm can improve generalization. [sent-33, score-0.084]

22 Even when the improvement is not signiﬁcant, the difference between the training error and the test error decreases signiﬁcantly and the ﬁnal classiﬁer is much sparser than A DA B OOST’s solution, both of which indicate reduced overﬁtting. [sent-34, score-0.274]

23 Since the Laplacian penalty can be computed without knowing the labels, the algorithm can also be used for semi-supervised learning. [sent-35, score-0.187]

24 The weights are initialized uniformly in line 1 (Figure 2), and are updated in each iteration in line 10. [sent-45, score-0.086]

25 We suppose that we are given a base learner algorithm BASE Dn , w, P (·) that, in each iteration t, returns a base classiﬁer h(t) coming from a subset of H = h : Rd → {−1, 1} . [sent-46, score-0.993]

26 In A DA B OOST, the goal of the base classiﬁer is to minimize the weighted error n = (t) (t) wi I {h(xi ) = yi } , 12 (h) = i=1 1 2 The indicator function I{A} is 1 if its argument A is true and 0 otherwise. [sent-47, score-0.675]

27 D n is the training data, BASE is the base learner, P is the penalty functional, λ is the penalty coefﬁcient, and T is the number of iterations. [sent-53, score-0.764]

28 which is equivalent to maximizing the edge γ = 1 − 2 = R EG B OOST’s base learner is to minimize the penalized cost R1 (h) = (h) + λP (h) = n i=1 (t) wi h(xi )yi . [sent-54, score-0.652]

29 The goal of 1 1 − (γ − θ), 2 2 (1) where P : H → R is an arbitrary penalty functional or regularization operator, provided to R EG B OOST and to the base learner, λ is the penalty coefﬁcient, and θ = 2λP (h) is the edge offset. [sent-55, score-0.854]

30 Intuitively, the edge γ quantiﬁes by how much h is better than a random guess, while the edge offset θ indicates by how much h(t) must be better than a random guess. [sent-56, score-0.185]

31 This means that for complex base classiﬁers (with large penalties), we require a better base classiﬁcation than for simple classiﬁers. [sent-57, score-0.86]

32 The main advantage of R1 is that it has the form of conventional regularized error minimization, so it accommodates the boosting of all learning algorithms that minimize an error functional of this form (e. [sent-58, score-0.429]

33 3 If computationally possible, the base learner should minimize R2 (h) = 2 1− 1+θ 1+θ 1−θ 1−θ = 1+γ 1+θ 1+θ 1−γ 1−θ 1−θ . [sent-62, score-0.505]

34 After computing the edge and the edge offset in lines 4 and 5, the algorithm sets the coefﬁcient α(t) of the base classiﬁer h(t) to 1 1 + γ (t) 1 1 + θ (t) α(t) = ln − ln . [sent-64, score-0.655]

35 (3) 2 2 1 − γ (t) 1 − θ (t) In line 11, the algorithm returns the weighted average of the base classiﬁers f (T ) (·) = T (t) (t) (T ) (x) to classify x. [sent-65, score-0.54]

36 4 In this case, the algorithm returns the actual combined classiﬁer in line 8. [sent-67, score-0.108]

37 This means that either the capacity of the set of base classiﬁers is too small (γ (t) is small), or the penalty is too high (θ (t) is high), so we cannot ﬁnd a new base classiﬁer that would improve the combined classiﬁer. [sent-68, score-1.051]

38 Note that the algorithm is formally equivalent to A DA B OOST if θ(t) ≡ 0 and to marginal A DA B OOST if θ (t) ≡ θ is constant. [sent-69, score-0.077]

39 For the analysis of the algorithm, we ﬁrst deﬁne the unnormalized margin achieved by f (T ) on (xi , yi ) as ρi = f (T ) (xi )yi , and the (normalized) margin as T t=1 ρi = α 1 ρi = α(t) h(t) (xi )yi T t=1 α(t) , (4) T where α 1 = t=1 α(t) is the L1 norm of the coefﬁcient vector. [sent-70, score-0.176]

40 Let the average penalty or margin offset be deﬁned as the average edge offset T (t) (t) t=1 α θ . [sent-71, score-0.358]

41 T (t) t=1 α ¯ θ= (5) The following theorem upper bounds the marginal training error L ¯ (θ) (f (T ) 1 )= n n ¯ I ρi < θ (6) i=1 achieved by the combined classiﬁer f (T ) that R EG B OOST outputs. [sent-72, score-0.235]

42 Let wi be the weight of training point (xi , yi ) after the tth iteration (updated in line 10 in Figure 2), and let α(t) be the weight of the base regressor h(t) (·) (computed in line 6 in Figure 2). [sent-74, score-0.767]

43 ¯ L(θ) (f (T ) ) = ≤ 1 n 1 n n T α(t) − t=1 i=1 n ¯ PT eθ t=1 α(t) h(t) (xi )yi ≥ 0 (8) t=1 α(t) − PT t=1 α(t) h(t) (xi )yi (9) i=1 ¯ PT = eθ T ¯ θ I t=1 T n α(t) n (t) wj e−α t=1 j=1 (t) h(t) (xj )yj (T +1) wi . [sent-78, score-0.088]

44 The theorem follows by (T +1) n the deﬁnition (5) and since i=1 wi = 1. [sent-81, score-0.104]

45 First note that Theorem 1 explains the base objectives (1) and (2) and the base coefﬁcient (3). [sent-82, score-0.86]

46 The goal of R EG B OOST is the greedy minimization of the exponential bound in (7), that is, in each iteration we attempt to minimize E (t) (α, h). [sent-83, score-0.089]

47 Given h(t) , E (t) α, h(t) is minimized by (3), and with this choice for α(t) , R2 (h) = E (t) α(t) , h , so the base learner should attempt to minimize R2 (h). [sent-84, score-0.505]

48 First, by (9) it can be seen that R EG B OOST directly minimizes n 1 ¯ exp −ρi + θ α 1 , n i=1 which can be interpreted as an exponential cost on the unnormalized margin with an L 1 ¯ weight decay. [sent-89, score-0.144]

49 The weight decay coefﬁcient θ is proportional to the average complexity of the base classiﬁers. [sent-90, score-0.535]

50 Second, Theorem 1 also indicates that R EG B OOST indirectly min¯ ¯ imizes the marginal error L(θ) (f (T ) ) (6) where the margin parameter θ, again, is moving adaptively with the average complexity of the base classiﬁers. [sent-91, score-0.652]

51 This explanation is supported by theoretical results that bound the generalization error in terms of the marginal error (e. [sent-92, score-0.254]

52 The third explanation is based on results that show that the difference between the marginal error and the generalization error can be upper bounded in terms of the complexity of the base classiﬁer class H (e. [sent-95, score-0.705]

53 By imposing a non-zero penalty on the base classiﬁers, we can reduce the pool of admissible functions to those of which the edge γ is larger than the edge offset θ. [sent-98, score-0.847]

54 Although the theoretical results do not apply directly, they support the empirical evidence (Section 4) that indicate that the reduction of the pool of admissible base classiﬁers and the sparsity of the combined classiﬁer play an important role in decreasing the generalization error. [sent-99, score-0.624]

55 Finally note that the algorithm can be easily extended to real-valued base classiﬁers along the lines of [10] and to regression by using the algorithm proposed in [11]. [sent-100, score-0.515]

56 If base classiﬁers come from the set {h : Rd → R}, we can only use the base objective R1 (h) (1), and the analytical solution (3) for the base coefﬁcients α(t) must be replaced by a simple numerical minimization (line search) of E (t) α, h(t) . [sent-101, score-1.328]

57 , quadratic), and the ﬁnal regressor should be the weighted median of the base regressors instead of their weighted average. [sent-104, score-0.538]

58 3 The graph Laplacian regularizer The algorithm can be used with any regularized base learner that optimizes a penalized cost of the form (1). [sent-105, score-0.634]

59 In this paper we apply a smoothness functional based on the graph 5 Note that if θ is constant (A DA B OOST or marginal A DA B OOST), the minimization of R 1 (h) and R2 (h) leads to the same solution, namely, to the base classiﬁer that minimizes the weighted error . [sent-106, score-0.737]

60 6 As a side remark, note that applying a non-zero (even constant) penalty θ would provide an alternative solution to the singularity problem (α(t) = ∞) in the abstaining base classiﬁer model of [10]. [sent-108, score-0.576]

61 The advantage of this penalty is that it is relatively simple to compute for enumerable base classiﬁers (e. [sent-110, score-0.576]

62 , decision stumps or decision trees) and that it suits applications where the data exhibits a low dimensional manifold structure. [sent-112, score-0.147]

63 Formally, let G = (V, E) be the neighborhood graph of the training set where the vertex set V = {x1 , . [sent-113, score-0.131]

64 , xn } is identical to the set of observations, and the edge set E contains pairs of “neighboring” vertices (xi , xj ) such that either xi − xj < r or xi (xj ) is among the k nearest neighbors of xj (xi ) where r or k is ﬁxed. [sent-116, score-0.287]

65 This graph plays a crucial role in several recently developed dimensionality reduction methods since it approximates the natural topology of the data if it is conﬁned to a low-dimensional smooth manifold in the embedding space. [sent-117, score-0.117]

66 7 For binary base classiﬁers, PL (h) is proportional to the number of separated neighbors, that is, the number of connected pairs that are classiﬁed differently by h. [sent-119, score-0.452]

67 , h(xn ) , and ei and λi are the (normalized) eigenvectors and eigenvalues of L, that is, Lei = λi ei , ei = 1. [sent-124, score-0.189]

68 The eigenvectors with the smallest eigenvalues can be considered as the “smoothest” functions on the neighborhood graph. [sent-126, score-0.151]

69 Based on this observation, [4] proposed to learn a linear combination of a small number of the eigenvectors with the smallest eigenvalues. [sent-127, score-0.107]

70 Our approach is based on the same intuition, but instead of looking for a linear combination of the eigenvectors, we form a linear combination of known base functions and penalize them according to their smoothness on the underlying manifold. [sent-129, score-0.477]

71 So, beside semi-supervised learning (explored in Section 4), our algorithm can also be used to classify out-of-sample test observations. [sent-130, score-0.095]

72 The penalty functional can also be justiﬁed from the point of view of spectral clustering [5]. [sent-131, score-0.207]

73 The eigenvectors of L with the smallest eigenvalues8 represent “natural” clusters in the data set, so PL (h) is small if h is aligned with these eigenvectors, and PL (h) is large if h splits the corresponding clusters. [sent-132, score-0.087]

74 Nevertheless, if the neighborhood graph is constructed by connecting a ﬁxed e number of nearest neighbors, Dii is approximately constant, so the eigenvectors of L and L are approximately equal. [sent-136, score-0.145]

75 The results are preliminary in the sense that we only validated the penalty coefﬁcient λ, and did not optimize the number of neighbors (set to k = 8) and the weighting scheme of the edges of the neighborhood graph (Wij = 0 or 1). [sent-138, score-0.265]

76 We used decision stumps as base classiﬁers, 10-fold cross validation for estimating errors, and 5-fold cross validation for determining λ. [sent-139, score-0.504]

77 Although the improvement is within the standard deviation, the difference between the test and the training error decreases signiﬁcantly in two of the four experiments, which indicates reduced overﬁtting. [sent-141, score-0.173]

78 To measure how the penalty affects the base classiﬁer pool, in each iteration we calculated the number of admissible base classiﬁers relative to the total number of stumps considered by A DA B OOST. [sent-143, score-1.182]

79 Figure 3(e) shows that, as expected, R EG B OOST traverses only a (sometimes quite small) subset of the base classiﬁer space. [sent-144, score-0.43]

80 25 (c) breast cancer training error (AdaBoost) test error (AdaBoost) training error (RegBoost) test error (RegBoost) 0. [sent-146, score-0.554]

81 09 sonar training error (AdaBoost) test error (AdaBoost) training error (RegBoost) test error (RegBoost) 0. [sent-148, score-0.523]

82 6 training error (AdaBoost) test error (AdaBoost) training error (RegBoost) test error (RegBoost) 0. [sent-150, score-0.458]

83 01 0 0 1 10 100 1000 1 10 100 t pima indians diabetes 0. [sent-165, score-0.166]

84 35 0 1000 1 10 t training error (AdaBoost) test error (AdaBoost) training error (RegBoost) test error (RegBoost) 0. [sent-166, score-0.458]

85 3 1 1000 semi-supervised ionosphere ionosphere breast cancer sonar pima indians diabetes 0. [sent-167, score-0.463]

86 2 training error (AdaBoost) test error (AdaBoost) training error (RegBoost) test error (RegBoost) 0. [sent-169, score-0.458]

87 Test and training errors for the (a) ionosphere, (b) breast cancer, (c) sonar, and (d) Pima Indians diabetes data sets. [sent-191, score-0.169]

88 (f) Test and training errors for the ionosphere data set with 100 labeled and 251 unlabeled data points. [sent-193, score-0.133]

89 data set ionosphere breast cancer sonar Pima Indians training error A DA B R EG B 0% 0% 0% 2. [sent-194, score-0.346]

90 8) # of stumps A DA B R EG B 182 114 58 30 234 199 175 91 Table 1: Errors rates and number of base classiﬁers after 1000 iterations. [sent-213, score-0.504]

91 Since the Laplacian penalty can be computed without knowing the labels, the algorithm can also be used for semi-supervised learning. [sent-214, score-0.187]

92 In this case, R EG B OOST can use the combined data set to calculate the penalty, whereas both algorithms can use only the labeled points to determine the base errors. [sent-216, score-0.456]

93 R EG B OOST is also far better than the semi-supervised algorithm proposed in [12] (their best test error using the same settings is 18%). [sent-218, score-0.153]

94 5 Conclusion In this paper we proposed to combine two powerful ideas, boosting and manifold learning. [sent-219, score-0.215]

95 The algorithm can be used to boost any regularized base learner. [sent-220, score-0.495]

96 Experimental results indicate that R EG B OOST slightly improves A DA B OOST by incorporating knowledge on the structure of the data into base classiﬁer selection. [sent-221, score-0.45]

97 In the immediate future our goal is to conduct a larger scale experimental study in which we optimize all the parameters of the algorithm, and compare it not only to A DA B OOST, but also to marginal A DA B OOST, that is, R EG B OOST with a constant penalty θ. [sent-223, score-0.202]

98 Marginal A DA B OOST might exhibit a similar behavior on the supervised task (sparsity, reduced number of admissible base classiﬁers), however, it can not be used to semi-supervised learning. [sent-224, score-0.504]

99 Singer, “Improved boosting algorithms using conﬁdence-rated predictions,” Machine Learning, vol. [sent-284, score-0.122]

100 K´ gl, “Robust regression by boosting the median,” in Proceedings of the 16th Conference e on Computational Learning Theory, Washington, D. [sent-289, score-0.145]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('oost', 0.616), ('base', 0.43), ('da', 0.284), ('eg', 0.215), ('regboost', 0.186), ('penalty', 0.146), ('adaboost', 0.136), ('boosting', 0.122), ('classi', 0.122), ('ers', 0.113), ('pl', 0.097), ('er', 0.088), ('laplacian', 0.08), ('error', 0.075), ('admissible', 0.074), ('stumps', 0.074), ('manifold', 0.073), ('ionosphere', 0.068), ('wi', 0.068), ('indians', 0.065), ('sonar', 0.065), ('edge', 0.057), ('coef', 0.057), ('eigenvectors', 0.056), ('marginal', 0.056), ('breast', 0.055), ('offset', 0.052), ('pima', 0.052), ('learner', 0.052), ('margin', 0.051), ('yi', 0.051), ('diabetes', 0.049), ('accommodates', 0.049), ('xi', 0.046), ('neighborhood', 0.045), ('wij', 0.045), ('regularized', 0.044), ('graph', 0.044), ('cut', 0.044), ('training', 0.042), ('functional', 0.041), ('cancer', 0.041), ('minimization', 0.038), ('ei', 0.038), ('beside', 0.037), ('test', 0.037), ('theorem', 0.036), ('xj', 0.036), ('dn', 0.036), ('gl', 0.034), ('regularization', 0.034), ('decay', 0.033), ('mason', 0.032), ('regressor', 0.032), ('returns', 0.032), ('smallest', 0.031), ('pool', 0.031), ('neighbors', 0.03), ('weight', 0.029), ('schapire', 0.029), ('line', 0.029), ('iteration', 0.028), ('weighted', 0.028), ('penalization', 0.028), ('pt', 0.026), ('dii', 0.026), ('niyogi', 0.026), ('combined', 0.026), ('smoothness', 0.025), ('sparser', 0.025), ('explanation', 0.024), ('penalties', 0.024), ('generalization', 0.024), ('minimize', 0.023), ('errors', 0.023), ('regression', 0.023), ('cuts', 0.023), ('unnormalized', 0.023), ('cost', 0.022), ('proportional', 0.022), ('penalize', 0.022), ('manifolds', 0.022), ('cantly', 0.022), ('algorithm', 0.021), ('belkin', 0.021), ('regularizer', 0.021), ('complexity', 0.021), ('wj', 0.02), ('median', 0.02), ('knowing', 0.02), ('tsch', 0.02), ('proposed', 0.02), ('indicate', 0.02), ('spectral', 0.02), ('sparsity', 0.019), ('ln', 0.019), ('indicates', 0.019), ('interpreted', 0.019), ('eigenvalues', 0.019), ('improve', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

2 0.63732326 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the ﬁnal regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efﬁcient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a ﬁnite number of steps. Then we provide conﬁdence-interval-type bounds on the generalization error. 1

3 0.16399516 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse

Abstract: We study a method of optimal data-driven aggregation of classiﬁers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classiﬁers of activation patterns in fMRI based on locally trained SVM classiﬁers. The aggregation coefﬁcients are then used to design a ”boosting map” of the brain needed to identify the regions with most signiﬁcant impact on classiﬁcation. 1

4 0.12314686 19 nips-2004-An Application of Boosting to Graph Classification

Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto

Abstract: This paper presents an application of Boosting for classifying labeled graphs, general structures for modeling a number of real-world data, such as chemical compounds, natural language texts, and bio sequences. The proposal consists of i) decision stumps that use subgraph as features, and ii) a Boosting algorithm in which subgraph-based decision stumps are used as weak learners. We also discuss the relation between our algorithm and SVMs with convolution kernels. Two experiments using natural language data and chemical compounds show that our method achieves comparable or even better performance than SVMs with convolution kernels as well as improves the testing efﬁciency. 1

5 0.11931521 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester

Abstract: We consider the problem of structured classiﬁcation, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random ﬁelds and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem deﬁned in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efﬁcient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efﬁciently. The method for structured labels relies on a more general result, speciﬁcally the application of exponentiated gradient updates [7, 8] to quadratic programs. 1

6 0.11350536 70 nips-2004-Following Curved Regularized Optimization Solution Paths

7 0.079704054 131 nips-2004-Non-Local Manifold Tangent Learning

8 0.07740932 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

9 0.076145977 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

10 0.075003453 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

11 0.074944027 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

12 0.074442267 137 nips-2004-On the Adaptive Properties of Decision Trees

13 0.07149227 178 nips-2004-Support Vector Classification with Input Data Uncertainty

14 0.06998346 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

15 0.069696486 136 nips-2004-On Semi-Supervised Classification

16 0.068989106 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

17 0.066532016 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

18 0.065300241 168 nips-2004-Semigroup Kernels on Finite Sets

19 0.065251082 115 nips-2004-Maximum Margin Clustering

20 0.060143482 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, 0.114), (2, -0.016), (3, 0.209), (4, 0.024), (5, 0.052), (6, 0.184), (7, 0.152), (8, -0.312), (9, 0.451), (10, -0.22), (11, -0.319), (12, 0.1), (13, -0.02), (14, 0.185), (15, -0.007), (16, 0.085), (17, -0.065), (18, 0.137), (19, -0.064), (20, -0.128), (21, -0.012), (22, 0.067), (23, 0.016), (24, 0.078), (25, -0.095), (26, -0.034), (27, 0.025), (28, 0.056), (29, 0.054), (30, 0.044), (31, 0.025), (32, 0.045), (33, 0.0), (34, 0.004), (35, -0.016), (36, -0.029), (37, 0.057), (38, -0.055), (39, 0.032), (40, -0.001), (41, -0.066), (42, 0.031), (43, -0.003), (44, 0.003), (45, 0.012), (46, -0.035), (47, 0.006), (48, -0.028), (49, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95560855 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

same-paper 2 0.95447713 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

3 0.51327229 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse

4 0.34483597 19 nips-2004-An Application of Boosting to Graph Classification

Author: Taku Kudo, Eisaku Maeda, Yuji Matsumoto

5 0.27881557 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

Author: Omid Madani, David M. Pennock, Gary W. Flake

Abstract: In the context of binary classiﬁcation, we deﬁne disagreement as a measure of how often two independently-trained models differ in their classiﬁcation of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1

6 0.2501322 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

7 0.24119601 136 nips-2004-On Semi-Supervised Classification

8 0.21891721 82 nips-2004-Incremental Algorithms for Hierarchical Classification

9 0.21691218 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

10 0.21620765 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

11 0.2152086 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

12 0.20987616 103 nips-2004-Limits of Spectral Clustering

13 0.20947662 137 nips-2004-On the Adaptive Properties of Decision Trees

14 0.20925726 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

15 0.20722021 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

16 0.20546973 131 nips-2004-Non-Local Manifold Tangent Learning

17 0.20443396 70 nips-2004-Following Curved Regularized Optimization Solution Paths

18 0.2029818 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

19 0.20233251 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

20 0.19853835 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.258), (13, 0.088), (15, 0.113), (26, 0.047), (31, 0.04), (33, 0.239), (39, 0.024), (50, 0.035), (76, 0.013), (87, 0.018), (94, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93110275 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously ﬁnd syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classiﬁcation with models that exclusively use short- and long-range dependencies respectively. 1

2 0.91249627 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

Author: Jianlin Cheng, Alessandro Vullo, Pierre F. Baldi

Abstract: The formation of disulphide bridges among cysteines is an important feature of protein structures. Here we develop new methods for the prediction of disulphide bond connectivity. We ﬁrst build a large curated data set of proteins containing disulphide bridges and then use 2-Dimensional Recursive Neural Networks to predict bonding probabilities between cysteine pairs. These probabilities in turn lead to a weighted graph matching problem that can be addressed efﬁciently. We show how the method consistently achieves better results than previous approaches on the same validation data. In addition, the method can easily cope with chains with arbitrary numbers of bonded cysteines. Therefore, it overcomes one of the major limitations of previous approaches restricting predictions to chains containing no more than 10 oxidized cysteines. The method can be applied both to situations where the bonded state of each cysteine is known or unknown, in which case bonded state can be predicted with 85% precision and 90% recall. The method also yields an estimate for the total number of disulphide bridges in each chain. 1

same-paper 3 0.87330931 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

4 0.82783854 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of ﬁlter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which ﬁlter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efﬁcacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reﬂections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear ﬁlter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive ﬁeld development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive ﬁeld-like ﬁlters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and ﬁnance.16 Many approaches to the unsupervised speciﬁcation of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear ﬁlters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to ﬁnd combinations of these (perhaps non-linearly transformed) as a way of ﬁnding higher order ﬁlters (eg complex cells). One critical facet whose speciﬁcation from data is not obvious is the neighborhood arrangement, ie which linear ﬁlters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for ﬁnding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, ﬁlter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in ﬁgure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n ﬁlter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modiﬁed Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to ﬁlter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM speciﬁes a hierarchy of mixer variables. Wainwright2 considered a prespeciﬁed tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being ﬁlter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each ﬁlter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic ﬁlter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to ﬁll the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of ﬁlter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by deﬁnition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the ﬁlters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear ﬁlters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which ﬁlter responses share which mixer variables. We ﬁrst study this issue using synthetic data in which two groups of ﬁlter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (ﬁgure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few ﬁlter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single ﬁlter response, the Gaussian estimate peaks away from zero. For assignments including more ﬁlter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including ﬁlter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each ﬁlter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each ﬁlter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of ﬁlter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of ﬁlter associations corresponding to vα , vβ , and vγ . C Statistics of generated ﬁlter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of ﬁlter responses increases, the estimates improve, provided that they are taken from the right group of ﬁlter responses with the same mixer variable. Speciﬁcally, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single ﬁlter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many ﬁlter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefﬁcients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in ﬁgure 1 suggest that it should be possible to infer the assignments, ie work out which ﬁlter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each ﬁlter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between ﬁlter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that ﬁlter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of ﬁlter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a ﬁlter response li , we use Gibbs sampling for the E phase to ﬁnd possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple ﬁlter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the ﬁlter response samples just to those associated with each mixer variable. We tested the ability of this inference method to ﬁnd the associations in the probabilistic mixer variable synthetic example shown in ﬁgure 2, (A,B). The true generative model speciﬁes probabilistic overlap of 3 mixer variables. We generated 5000 samples for each ﬁlter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the ﬁlter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently ﬁnds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefﬁcient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear ﬁlters from a multi-scale oriented steerable pyramid,28 with 100 ﬁlters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in ﬁgure 2). Figure 3 depicts the association weights pij of the coefﬁcients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefﬁcients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefﬁcient to be generated from a given mixer variable. For the ﬁrst two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The ﬁrst neighborhood (in B) prefers vertical patterns across most of its “receptive ﬁeld”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefﬁcient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal ﬁlters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with ﬁlter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between ﬁlters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the ﬁrst 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefﬁcients of two spatially displaced vertical ﬁlters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet ﬁlters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA ﬁrst-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet ﬁlters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefﬁcients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In ﬁgure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial conﬁgurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial conﬁguration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal conﬁguration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefﬁcients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to ﬁlter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method ﬁrmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent ﬂows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefﬁcients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artiﬁcial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge ﬁlters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

5 0.80910784 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

6 0.76424122 145 nips-2004-Parametric Embedding for Class Visualization

7 0.73579752 200 nips-2004-Using Random Forests in the Structured Language Model

8 0.73423725 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

9 0.73157763 44 nips-2004-Conditional Random Fields for Object Recognition

10 0.73075521 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

11 0.73058861 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

12 0.73046392 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

13 0.73021942 102 nips-2004-Learning first-order Markov models for control

14 0.73014528 77 nips-2004-Hierarchical Clustering of a Mixture Model

15 0.7299515 54 nips-2004-Distributed Information Regularization on Graphs

16 0.72973895 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

17 0.72871691 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

18 0.72710007 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

19 0.72657186 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

20 0.72576052 124 nips-2004-Multiple Alignment of Continuous Time Series