nips nips2004 nips2004-100 knowledge-graph by maker-knowledge-mining

100 nips-2004-Learning Preferences for Multiclass Problems

Source: pdf

Author: Fabio Aiolli, Alessandro Sperduti

Abstract: Many interesting multiclass problems can be cast in the general framework of label ranking deﬁned on a given set of classes. The evaluation for such a ranking is generally given in terms of the number of violated order constraints between classes. In this paper, we propose the Preference Learning Model as a unifying framework to model and solve a large class of multiclass problems in a large margin perspective. In addition, an original kernel-based method is proposed and evaluated on a ranking dataset with state-of-the-art results. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 it Abstract Many interesting multiclass problems can be cast in the general framework of label ranking deﬁned on a given set of classes. [sent-7, score-0.721]

2 The evaluation for such a ranking is generally given in terms of the number of violated order constraints between classes. [sent-8, score-0.338]

3 In this paper, we propose the Preference Learning Model as a unifying framework to model and solve a large class of multiclass problems in a large margin perspective. [sent-9, score-0.424]

4 In addition, an original kernel-based method is proposed and evaluated on a ranking dataset with state-of-the-art results. [sent-10, score-0.319]

5 1 Introduction The presence of multiple classes in a learning domain introduces interesting tasks besides the one to select the most appropriate class for an object, the well-known (single-label) multiclass problem. [sent-11, score-0.417]

6 In this paper we focus on multiclass problems where labels are given as partial order constraints over the classes. [sent-14, score-0.438]

7 Tasks naturally falling into this family include category ranking, which is the task to infer full orders over the classes, binary category ranking, which is the task to infer orders such that a given subset of classes are top-ranked, and any general (q-label) classiﬁcation problem. [sent-15, score-0.303]

8 Recently, efforts have been made in the direction to unify different ranking problems. [sent-16, score-0.272]

9 In particular, in [5, 7] two frameworks have been proposed which aim at inducing a label ranking function from examples. [sent-17, score-0.379]

10 Similarly, here we consider labels coded into sets of preference constraints, expressed as preference graphs over the set of classes. [sent-18, score-0.821]

11 The multiclass problem is then reduced to learning a good set of scoring functions able to correctly rank the classes according to the constraints which are associated to the label of the examples. [sent-19, score-0.731]

12 Each preference graph disagreeing with the obtained ranking function will count as an error. [sent-20, score-0.685]

13 The primary contribution of this work is to try to make a further step towards the uniﬁcation of different multiclass settings, and different models to solve them, by proposing the Preference Learning Model, a very general framework to model and study several kinds of multiclass problems. [sent-21, score-0.636]

14 In addition, a kernel-based method particularly suited for this setting is proposed and evaluated in a binary category ranking task with very promising results. [sent-22, score-0.45]

15 The Multiclass Setting Let Ω be a set of classes, we consider a multiclass setting where data are supposed to be sampled according to a probability distribution D over X × Y, X ⊆ Rd and an hypothesis space of functions F = {fΘ : X × Ω → R} with parameters Θ. [sent-23, score-0.47]

16 Moreover, a cost function c(x, y|Θ) deﬁnes the cost suffered by a given hypothesis on a pattern x ∈ X having label y ∈ Y. [sent-24, score-0.436]

17 A multiclass learning algorithm searches for a set of parameters Θ∗ such to minimize the true cost, that is the expected value of the cost according to the true distribution of data, i. [sent-25, score-0.427]

18 2 The Preference Learning Model In this section, starting from the general multiclass setting described above, we propose a general technique to solve a large family of multiclass settings. [sent-36, score-0.686]

19 The basic idea is to ”code” labels of the original multiclass problem as sets of ranking constraints given as preference graphs. [sent-37, score-1.092]

20 Then, we introduce the Preference Learning Model (PLM) for the induction of optimal scoring functions that uses those constraints as supervision. [sent-38, score-0.149]

21 In the case of ranking-based multiclass settings, labels are given as partial orders over the classes (see [1] for a detailed taxonomy of multiclass learning problems). [sent-39, score-0.765]

22 Moreover, as observed in [5], ranking problems can be generalized by considering labels given as preference graphs over a set of classes Ω = {ω1 , . [sent-40, score-0.842]

23 , ωm }, and trying to ﬁnd a consistent ranking function fR : X → Π(Ω) where Π(Ω) is the set of permutations over Ω. [sent-43, score-0.306]

24 More formally, considering a set Ω, a preference graph or ”p-graph” over Ω is a directed graph v = (N, A) where N ⊆ Ω is the set of nodes and A is the set of arcs of the graph accessed by the function A(v). [sent-44, score-0.561]

25 An arc a ∈ A is associated with its starting node ωs = ωs (a) and its ending node ωe = ωe (a) and represents the information that the class ωs is preferred to, and should be ranked higher than, ωe . [sent-45, score-0.234]

26 Let be given a set of scoring functions f : X × Ω → R with parameters Θ working as predictors of the relevance of the associated class to given instances. [sent-47, score-0.166]

27 A deﬁnition of a ranking function naturally follows by taking the permutation of elements in Ω corresponding to the sorting of the values of these functions, i. [sent-48, score-0.272]

28 We say that a preference arc a = (ωs , ωe ) is consistent with a ranking hypothesis fR (x|Θ), and we write a fR (x|Θ), when f (x, ωs |Θ) ≥ f (x, ωe |Θ) holds. [sent-51, score-0.906]

29 Generalizing to graphs, a p-graph g is said to be consistent with an hypothesis fR (x|Θ), and we write g fR (x|Θ), if every arc compounding it is consistent, i. [sent-52, score-0.293]

30 The PLM Mapping Let us start by considering the way a multiclass problem is transformed into a PLM problem. [sent-55, score-0.352]

31 As seen before, to evaluate the quality of a ranking function fR (x|Θ) is necessary to specify the nature of a cost function c(x, y|Θ). [sent-56, score-0.381]

32 Speciﬁcally, we consider cost deﬁnitions corresponding to associate penalties whenever uncorrect decisions are made (e. [sent-57, score-0.15]

33 a classiﬁcation error for classiﬁcation problems or wrong ordering for ranking problems). [sent-59, score-0.296]

34 To this end, as in [5], we consider a label mapping G : y → {g1 (y), . [sent-60, score-0.201]

35 , gqy (y)} where a set of subgraphs gi (y) ∈ G(Ω) are associated to each label y ∈ Y. [sent-63, score-0.465]

36 The total cost suffered by a ranking hypothesis fR on the example x ∈ X labeled y ∈ Y is the number of p-graphs in G(y) not consistent with the ranking, i. [sent-64, score-0.526]

37 To clarify, in Figure 1 a set of mapping examples are proposed. [sent-69, score-0.117]

38 Considering Ω = {1, 2, 3, 4, 5}, in Figure 1-(a) the label y = [1, 2|3, 4, 5] for a 2-label classiﬁcation setting is given. [sent-70, score-0.157]

39 In particular, this corresponds to the mapping G(y) = GI (y) = y where a single wrong ranking of a class makes the predictor to pay a unit of cost. [sent-71, score-0.392]

40 Similarly, in Figure 1-(b) the label mapping G(y) = GD (y) is presented for the same problem. [sent-72, score-0.201]

41 Another variant is presented in Figure 1-(c) where the label mapping G(y) = Gd (y) is used and the target classes are independently evaluated and their errors cumulated. [sent-73, score-0.297]

42 Note that all these graphs are subgraphs of the original label in 1-(a). [sent-74, score-0.185]

43 As an additional example we consider the three cases depicted in the right hand side of Figure 1 that refer to a ranking problem with three classes Ω = {1, 2, 3}. [sent-75, score-0.345]

44 As before, this also corresponds to the label mapping G(y) = GI (y). [sent-77, score-0.201]

45 Two alternative cost deﬁnitions can be obtained by using the p-graphs (sets of basic preferences actually) depicted in Figure 1-(e) and 1-(f). [sent-78, score-0.176]

46 Note that the cost functions in these cases are different. [sent-79, score-0.136]

48 The PLM Setting Once the label mapping G is ﬁxed, the preference constraints of the original multiclass problem can be arranged into a set of preference constraints. [sent-81, score-1.373]

49 ,qy } and each pair (x, g) ∈ X × G(Ω) is a preference constraint. [sent-84, score-0.382]

50 This can happen, for example, when multiple ranking constraints are associated to the same example of the original multiclass problem. [sent-86, score-0.715]

51 Because of this, in the following, we prefer to use a different notation for the instances in preference constraints to avoid confusion with training examples. [sent-87, score-0.468]

52 For a preference constraint (v, g) ∈ V, the constraint error incurred by the ranking hypothesis fR (v|Θ) is given by δ(v, g|Θ) = [[g fR (v|Θ)]]. [sent-89, score-0.835]

53 The empirical cost is then deﬁned N as the cost over the whole constraint set, i. [sent-90, score-0.301]

54 In addition, we deﬁne the margin of an hypothesis on a pattern v for a preference arc a = (ωs , ωe ), expressing how well the preference is satisﬁed, as the difference between the scores of the two linked nodes, i. [sent-93, score-1.038]

55 The margin for a pgraph constraint (v, g) is then deﬁned as the minimum of the margin of the compounding preferences, ρG (v, g|Θ) = mina∈A(g) ρA (v, a|Θ), and gives a measure of how well the hypothesis fulﬁlls a given preference constraint. [sent-96, score-0.663]

56 Learning in PLM In the PLM we try to learn a ”simple” hypothesis able to minimize the empirical cost of the original multiclass problem or equivalently to satisfy the constraints in V(S) as much as possible. [sent-98, score-0.622]

57 Given a set V of pairs (vi , gi ) ∈ X × G(Ω), i ∈ {1, . [sent-100, score-0.255]

58 , N }, N = i=1 qyi , ﬁnd a set of parameters for the ranking function fR (v|Θ) able to minimize a combination ˆ of a regularization and an empirical loss term, Θ = arg minΘ {Re [Θ, V] + µR(Θ)} with µ a given constant. [sent-103, score-0.387]

59 The problem of learning with multiple classes (up to constant factors) is then reduced to a minimization of a (possibly regularized) loss functional ˆ Θ = arg min{L(V|Θ) + µR(Θ)} Θ where L(V|Θ) = N i=1 (2) maxa∈A(gi ) L(f (vi , ωs (a)|Θ) − f (vi , ωe (a)|Θ)). [sent-106, score-0.235]

60 Interestingly, PLM highlights that this approximation in fact corresponds to a change on the label mapping obtained by decomposing a complex preference graph into a set of binary preferences and thus changing the cost deﬁnition we are indeed minimizing. [sent-116, score-0.822]

61 Least Square Exponential L(ρ) [1 − β −1 ρ]+ log2 (1 + exp(−ρ)) [1 − ρ]+ [1 − ρ]2 + exp(−ρ) Multiclass Prediction through PLM A multiclass prediction is a function H : X → Y mapping instances to their associated label. [sent-119, score-0.467]

62 Let be given a label mapping deﬁned as G(y) = {g1 (y), . [sent-120, score-0.201]

63 Then, the PLM multiclass prediction is given as the label whose induced preference constraints mostly agree with the current hypothesis, i. [sent-124, score-0.873]

64 3 Preference Learning with Kernel Machines In this section, we focus on a particular setting of the PLM framework consisting of a multivariate embedding h : X → Rs of linear functions parameterized by a set of vectors Wk ∈ Rd , k ∈ {1, . [sent-131, score-0.144]

65 This matrix has the same role as the coding matrix in multiclass coding, e. [sent-147, score-0.368]

66 Finally, the scoring function for a given class is computed as the dot product between the embedding function and the class code vector s f (x, ωr |W, M ) = h(x), Mr = Mrk Wk , x k=1 (3) Now, we are able to describe a kernel-based method for the effective solution of the PLM problem. [sent-150, score-0.175]

67 In particular, we present the problem formulation and the associated optimization method for the task of learning the embedding function given ﬁxed codes for the classes (embedding problem). [sent-151, score-0.282]

68 Another worthwhile task consists in the optimization of the codes for the classes when the embedding function is kept ﬁxed (coding problem), or even to perform a combination of the two (see for example [8]). [sent-152, score-0.279]

69 Speciﬁcally, consider the vector y(a) = (Mωs (a) − Mωe (a) ) ∈ Rs deﬁned for every preference arc in a given preference constraint, that is a = (ωs , ωe ) ∈ A(g). [sent-155, score-0.907]

70 From this derivation it i k turns out that each preference of a constraint in the set V can be viewed as an example of dimension s · d in a binary classiﬁcation problem. [sent-160, score-0.467]

71 Each pair (vi , gi ) ∈ V then generates a number of examples in this extended binary problem equal to the number of arcs of the N p-graph gi for a total of i=1 |A(gi )| examples. [sent-161, score-0.617]

72 The Kernel Preference Learning Optimization As pointed out before, the central task in PLM is to learn scoring functions in such a way to be as much as possible consistent with the set of constraints in V. [sent-164, score-0.204]

73 For the embedding problem, instantiating the problem (2), and choosing the 2-norm of the parameters as reguN 1 ˆ larizer, we obtain W = arg minW N i=1 LC (vi , gi |W, M ) + µ||W ||2 where, according to Eq. [sent-166, score-0.346]

74 (1), the loss for each preference constraint is computed as the maximum between the losses of all the associated preferences, that is Li = maxa∈A(gi ) L( W, za ). [sent-167, score-0.7]

75 i When the constraint set in V contains basic preferences only (that is p-graphs consisting of a single arc ai = A(gi )), the optimization problem can be simpliﬁed into the minimization of a standard functional combining a loss function with a regularization term. [sent-168, score-0.411]

76 See [11] for a set of examples of loss functions and the formulation of the associated problem with kernels. [sent-170, score-0.146]

77 Borrowing the idea of soft-margin [9], for each preference arc, a linear loss is used giving an upper bound on the indicator function loss. [sent-173, score-0.443]

78 Speciﬁcally, we use the SVM-like soft margin loss L(ρ) = [1 − ρ]+ . [sent-174, score-0.141]

79 These requirements can be expressed by the following quadratic problem N minW,ξ 1 ||W||2 + C i ξi 2 W, za ≥ 1 − ξi , i ∈ {1, . [sent-176, score-0.142]

80 , N } (5) Note that differently from the SVM formulation for the binary classiﬁcation setting, here the slack variables ξi are associated to multiple examples, one for each preference arc in the p-graph. [sent-180, score-0.64]

81 , N } maxα i,a (8) a a Since Wk = i,a yk (a)αi vi = i,a [Mωs (a) − Mωe (a) ]s αi vi , k = 1, . [sent-195, score-0.397]

82 , s, we obtain k a hk (x) = Wk , x = i,a [Mωs (a) −Mωe (a) ]s αi vi , x . [sent-197, score-0.166]

83 Embedding Optimization The problem in (8) recalls the one obtained for single-label multiclass SVM [1, 2] and, in fact, its optimization can be performed in a similar way. [sent-199, score-0.349]

84 Assuming a number of arcs for each preference constraint equal to q, the dual problem in (8) involves N · q variables leading to a very large scale problem. [sent-200, score-0.558]

85 However, it can be noted that the independence of constraints among the different preference constraints allows for the separation of the variables in N disjoints sets of q variables each. [sent-201, score-0.61]

86 The algorithm we propose for the optimization of the overall problem consists in iteratively selecting a preference constraint from the constraints set (a p-graph) and then optimizing with respect to the variables associated with it, that is one for each arc of the p-graph. [sent-202, score-0.758]

87 Let the preference constraint (vi , gi ) ∈ V be selected at a given iteration, to enforce the a a constraint a∈A(gi ) αi + λi = C, λi ≥ 0, two elements from the set of variables {αi |a ∈ A(gi )} ∪ {λi } will be optimized in pairs while keeping the solution inside the feasible a region αi ≥ 0. [sent-205, score-0.791]

88 We evaluated our framework on the binary category ranking task induced by the original multi-label classiﬁcation task, thus requiring rankings having target classes of the original multi-label problem on top. [sent-230, score-0.57]

89 IErr is the cost function indicating a non-perfect ranking and corresponds to the identity mapping in Figure 1-(a). [sent-233, score-0.475]

90 DErr is the cost deﬁned as the number of relevant classes uncorrectly ranked by the algorithm and corresponds to the domination mapping in Figure 1-(b). [sent-234, score-0.367]

91 dErr is the cost obtained counting the number of uncorrect rankings and corresponds to the disagreement mapping in Figure 1-(c). [sent-235, score-0.332]

92 Other two well-known Information Retrieval (IR) based cost functions have been used. [sent-236, score-0.136]

93 The OneErr cost function that is 1 whenever the top ranked class is not a relevant class and the average )≤rank 1 precision cost function, which is AvgP = |y| r∈y |{r ∈y:rank(x,r(x,r) (x,r)}| . [sent-237, score-0.3]

94 rank Results The model evaluation has been performed by comparing three different label mappings for KPLM and the baseline MMP algorithm [4], a variant of the Perceptron algorithm for ranking problems, with respect to the above-mentioned ranking losses. [sent-238, score-0.729]

95 KPLM has been implemented setting s = m and the standard basis vectors er ∈ Rm as codes associated to the classes. [sent-240, score-0.14]

96 Moreover, using identity and domination mappings seems to lead to models that outperform the ones obtained by using the disagreement mapping. [sent-247, score-0.15]

97 99 Table 1: Comparisons of ranking performance for different methods using different loss functions according to different evaluation metrics. [sent-271, score-0.36]

98 5 Conclusions and Future Work We have presented a common framework for the analysis of general multiclass problems and proposed a kernel-based method as an instance of this setting which has shown very good results on a binary category ranking task. [sent-273, score-0.748]

99 Promising directions of research, that we are currently pursuing, include experimenting with coding optimization and considering to extend the current setting to on-line learning, interdependent labels (e. [sent-274, score-0.195]

100 On the learnability and design of output codes for multiclass problems. [sent-295, score-0.373]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('plm', 0.409), ('preference', 0.382), ('multiclass', 0.318), ('ranking', 0.272), ('gi', 0.255), ('kplm', 0.245), ('fr', 0.203), ('vi', 0.166), ('arc', 0.143), ('za', 0.142), ('gd', 0.142), ('cost', 0.109), ('label', 0.107), ('wk', 0.104), ('mapping', 0.094), ('derr', 0.082), ('hypothesis', 0.075), ('classes', 0.073), ('lc', 0.071), ('preferences', 0.067), ('embedding', 0.067), ('constraints', 0.066), ('yk', 0.065), ('loss', 0.061), ('aiolli', 0.061), ('domination', 0.061), ('mmp', 0.061), ('qy', 0.061), ('rs', 0.06), ('scoring', 0.056), ('margin', 0.056), ('codes', 0.055), ('constraint', 0.053), ('arcs', 0.052), ('category', 0.052), ('mappings', 0.05), ('setting', 0.05), ('coding', 0.05), ('rankings', 0.049), ('variables', 0.048), ('classi', 0.048), ('maxa', 0.045), ('lagrangian', 0.043), ('avgp', 0.041), ('compounding', 0.041), ('gqy', 0.041), ('ierr', 0.041), ('kesler', 0.041), ('oneerr', 0.041), ('pisa', 0.041), ('sperduti', 0.041), ('uncorrect', 0.041), ('rx', 0.041), ('disagreement', 0.039), ('cation', 0.037), ('gj', 0.036), ('suffered', 0.036), ('associated', 0.035), ('ful', 0.035), ('consistent', 0.034), ('considering', 0.034), ('minw', 0.032), ('worthwhile', 0.032), ('binary', 0.032), ('optimization', 0.031), ('graph', 0.031), ('empirical', 0.03), ('ranked', 0.03), ('minimization', 0.03), ('labels', 0.03), ('rank', 0.028), ('graphs', 0.027), ('losses', 0.027), ('ordinal', 0.027), ('subgraphs', 0.027), ('functions', 0.027), ('class', 0.026), ('orders', 0.026), ('functional', 0.026), ('italy', 0.026), ('cally', 0.026), ('obtaining', 0.026), ('subgraph', 0.025), ('arg', 0.024), ('problems', 0.024), ('soft', 0.024), ('mr', 0.024), ('original', 0.024), ('dual', 0.023), ('examples', 0.023), ('evaluated', 0.023), ('kernel', 0.023), ('primal', 0.022), ('crammer', 0.022), ('predictors', 0.022), ('denoted', 0.022), ('task', 0.021), ('rm', 0.021), ('reduced', 0.021), ('instances', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 100 nips-2004-Learning Preferences for Multiclass Problems

Author: Fabio Aiolli, Alessandro Sperduti

2 0.17490672 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth

Abstract: The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classiﬁcation error rate suggests that the test sample size required to obtain an -accurate estimate of the expected accuracy of a ranking function with δ-conﬁdence is larger than that required to obtain an -accurate estimate of the expected error rate of a classiﬁcation function with the same conﬁdence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from ﬁnite function classes. 1

3 0.15096551 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

Author: Juan Coz, Gustavo F. Bayón, Jorge Díez, Oscar Luaces, Antonio Bahamonde, Carlos Sañudo

Abstract: In this paper we show that it is possible to model sensory impressions of consumers about beef meat. This is not a straightforward task; the reason is that when we are aiming to induce a function that maps object descriptions into ratings, we must consider that consumers’ ratings are just a way to express their preferences about the products presented in the same testing session. Therefore, we had to use a special purpose SVM polynomial kernel. The training data set used collects the ratings of panels of experts and consumers; the meat was provided by 103 bovines of 7 Spanish breeds with different carcass weights and aging periods. Additionally, to gain insight into consumer preferences, we used feature subset selection tools. The result is that aging is the most important trait for improving consumers’ appreciation of beef meat. 1

4 0.120115 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

Author: Amnon Shashua, Tamir Hazan

Abstract: This paper presents a general family of algebraic positive deﬁnite similarity functions over spaces of matrices with varying column rank. The columns can represent local regions in an image (whereby images have varying number of local parts), images of an image sequence, motion trajectories in a multibody motion, and so forth. The family of set kernels we derive is based on a group invariant tensor product lifting with parameters that can be naturally tuned to provide a cook-book of sorts covering the possible ”wish lists” from similarity measures over sets of varying cardinality. We highlight the strengths of our approach by demonstrating the set kernels for visual recognition of pedestrians using local parts representations. 1

5 0.098311342 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester

Abstract: We consider the problem of structured classiﬁcation, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random ﬁelds and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem deﬁned in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efﬁcient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efﬁciently. The method for structured labels relies on a more general result, speciﬁcally the application of exponentiated gradient updates [7, 8] to quadratic programs. 1

6 0.091416456 115 nips-2004-Maximum Margin Clustering

7 0.091246434 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification

8 0.082852893 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

9 0.079834543 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

10 0.073181018 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

11 0.07312347 54 nips-2004-Distributed Information Regularization on Graphs

12 0.071859956 19 nips-2004-An Application of Boosting to Graph Classification

13 0.069472849 160 nips-2004-Seeing through water

14 0.067565784 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

15 0.066086106 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

16 0.065920256 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

17 0.064853445 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

18 0.064563297 34 nips-2004-Breaking SVM Complexity with Cross-Training

19 0.062906034 82 nips-2004-Incremental Algorithms for Hierarchical Classification

20 0.062344097 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, 0.076), (2, -0.001), (3, 0.093), (4, 0.011), (5, 0.088), (6, 0.031), (7, 0.06), (8, 0.106), (9, -0.055), (10, -0.107), (11, 0.074), (12, 0.025), (13, 0.057), (14, -0.037), (15, 0.195), (16, -0.025), (17, -0.017), (18, -0.023), (19, 0.037), (20, 0.107), (21, -0.138), (22, -0.059), (23, -0.088), (24, 0.026), (25, -0.083), (26, 0.123), (27, 0.043), (28, 0.11), (29, -0.185), (30, 0.018), (31, 0.018), (32, 0.045), (33, -0.059), (34, -0.164), (35, 0.124), (36, 0.042), (37, 0.002), (38, -0.009), (39, 0.03), (40, 0.085), (41, 0.204), (42, 0.026), (43, -0.2), (44, 0.028), (45, -0.095), (46, -0.059), (47, 0.049), (48, -0.026), (49, -0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93790489 100 nips-2004-Learning Preferences for Multiclass Problems

Author: Fabio Aiolli, Alessandro Sperduti

2 0.67263669 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

Author: Juan Coz, Gustavo F. Bayón, Jorge Díez, Oscar Luaces, Antonio Bahamonde, Carlos Sañudo

3 0.49130604 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth

4 0.47911018 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification

Author: Tong Zhang

Abstract: We consider the problem of deriving class-size independent generalization bounds for some regularized discriminative multi-category classiﬁcation methods. In particular, we obtain an expected generalization bound for a standard formulation of multi-category support vector machines. Based on the theoretical result, we argue that the formulation over-penalizes misclassiﬁcation error, which in theory may lead to poor generalization performance. A remedy, based on a generalization of multi-category logistic regression (conditional maximum entropy), is then proposed, and its theoretical properties are examined. 1

5 0.47553429 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

Author: Olivier Chapelle, Za\

Abstract: Choice-based conjoint analysis builds models of consumer preferences over products with answers gathered in questionnaires. Our main goal is to bring tools from the machine learning community to solve this problem more efﬁciently. Thus, we propose two algorithms to quickly and accurately estimate consumer preferences. 1

6 0.47216201 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

7 0.46926245 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

8 0.4101406 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

9 0.40704143 19 nips-2004-An Application of Boosting to Graph Classification

10 0.35801294 141 nips-2004-Optimal sub-graphical models

11 0.32064325 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

12 0.31168461 115 nips-2004-Maximum Margin Clustering

13 0.30319697 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

14 0.30182299 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

15 0.30128884 145 nips-2004-Parametric Embedding for Class Visualization

16 0.30094668 54 nips-2004-Distributed Information Regularization on Graphs

17 0.29962975 34 nips-2004-Breaking SVM Complexity with Cross-Training

18 0.2984772 165 nips-2004-Semi-supervised Learning on Directed Graphs

19 0.28689578 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

20 0.28567833 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.087), (15, 0.108), (26, 0.068), (31, 0.015), (33, 0.177), (35, 0.037), (39, 0.025), (50, 0.026), (51, 0.011), (56, 0.011), (77, 0.011), (87, 0.335)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81330144 17 nips-2004-Adaptive Manifold Learning

Author: Jing Wang, Zhenyue Zhang, Hongyuan Zha

Abstract: Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better ﬁtting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets. 1

2 0.80523849 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

Author: Hyun J. Park, Te W. Lee

Abstract: Capturing dependencies in images in an unsupervised manner is important for many image processing applications. We propose a new method for capturing nonlinear dependencies in images of natural scenes. This method is an extension of the linear Independent Component Analysis (ICA) method by building a hierarchical model based on ICA and mixture of Laplacian distribution. The model parameters are learned via an EM algorithm and it can accurately capture variance correlation and other high order structures in a simple manner. We visualize the learned variance structure and demonstrate applications to image segmentation and denoising. 1 In trod u ction Unsupervised learning has become an important tool for understanding biological information processing and building intelligent signal processing methods. Real biological systems however are much more robust and flexible than current artificial intelligence mostly due to a much more efficient representations used in biological systems. Therefore, unsupervised learning algorithms that capture more sophisticated representations can provide a better understanding of neural information processing and also provide improved algorithm for signal processing applications. For example, independent component analysis (ICA) can learn representations similar to simple cell receptive fields in visual cortex [1] and is also applied for feature extraction, image segmentation and denoising [2,3]. ICA can approximate statistics of natural image patches by Eq.(1,2), where X is the data and u is a source signal whose distribution is a product of sparse distributions like a generalized Laplacian distribution. X = Au (1) P (u ) = ∏ P (u i ) (2) But the representation learned by the ICA algorithm is relatively low-level. In biological systems there are more high-level representations such as contours, textures and objects, which are not well represented by the linear ICA model. ICA learns only linear dependency between pixels by finding strongly correlated linear axis. Therefore, the modeling capability of ICA is quite limited. Previous approaches showed that one can learn more sophisticated high-level representations by capturing nonlinear dependencies in a post-processing step after the ICA step [4,5,6,7,8]. The focus of these efforts has centered on variance correlation in natural images. After ICA, a source signal is not linearly predictable from others. However, given variance dependencies, a source signal is still ‘predictable’ in a nonlinear manner. It is not possible to de-correlate this variance dependency using a linear transformation. Several researchers have proposed extensions to capture the nonlinear dependencies. Portilla et al. used Gaussian Scale Mixture (GSM) to model variance dependency in wavelet domain. This model can learn variance correlation in source prior and showed improvement in image denoising [4]. But in this model, dependency is defined only between a subset of wavelet coefficients. Hyvarinen and Hoyer suggested using a special variance related distribution to model the variance correlated source prior. This model can learn grouping of dependent sources (Subspace ICA) or topographic arrangements of correlated sources (Topographic ICA) [5,6]. Similarly, Welling et al. suggested a product of expert model where each expert represents a variance correlated group [7]. The product form of the model enables applications to image denoising. But these models don’t reveal higher-order structures explicitly. Our model is motivated by Lewicki and Karklin who proposed a 2-stage model where the 1st stage is an ICA model (Eq. (3)) and the 2 nd-stage is a linear generative model where another source v generates logarithmic variance for the 1st stage (Eq. (4)) [8]. This model captures variance dependency structure explicitly, but treating variance as an additional random variable introduces another level of complexity and requires several approximations. Thus, it is difficult to obtain a simple analytic PDF of source signal u and to apply the model for image processing problems. ( P (u | λ ) = c exp − u / λ q ) (3) log[λ ] = Bv (4) We propose a hierarchical model based on ICA and a mixture of Laplacian distribution. Our model can be considered as a simplification of model in [8] by constraining v to be 0/1 random vector where only one element can be 1. Our model is computationally simpler but still can capture variance dependency. Experiments show that our model can reveal higher order structures similar to [8]. In addition, our model provides a simple parametric PDF of variance correlated priors, which is an important advantage for adaptive signal processing. Utilizing this, we demonstrate simple applications on image segmentation and image denoising. Our model provides an improved statistic model for natural images and can be used for other applications including feature extraction, image coding, or learning even higher order structures. 2 Modeling nonlinear dependencies We propose a hierarchical or 2-stage model where the 1 st stage is an ICA source signal model and the 2nd stage is modeled by a mixture model with different variances (figure 1). In natural images, the correlation of variance reflects different types of regularities in the real world. Such specialized regularities can be summarized as “context” information. To model the context dependent variance correlation, we use mixture models where Laplacian distributions with different variance represent different contexts. For each image patch, a context variable Z “selects” which Laplacian distribution will represent ICA source signal u. Laplacian distributions have 0-mean but different variances. The advantage of Laplacian distribution for modeling context is that we can model a sparse distribution using only one Laplacian distribution. But we need more than two Gaussian distributions to do the same thing. Also conventional ICA is a special case of our model with one Laplacian. We define the mixture model and its learning algorithm in the next sections. Figure 1: Proposed hierarchical model (1st stage is ICA generative model. 2nd stage is mixture of “context dependent” Laplacian distributions which model U. Z is a random variable that selects a Laplacian distribution that generates the given image patch) 2.1 Mixture of Laplacian Distribution We define a PDF for mixture of M-dimensional Laplacian Distribution as Eq.(5), where N is the number of data samples, and K is the number of mixtures. N N K M N K r r r P(U | Λ, Π) = ∏ P(u n | Λ, Π) = ∏∑ π k P(u n | λk ) = ∏∑ π k ∏ n n k n k m 1 (2λ ) k ,m  u n,m exp −  λk , m      (5) r r r r r un = (un,1 , un , 2 , , , un,M ) : n-th data sample, U = (u1 , u 2 , , , ui , , , u N ) r r r r r λk = (λk ,1 , λk , 2 ,..., λk ,M ) : Variance of k-th Laplacian distribution, Λ = (λ1 , λ2 , , , λk , , , λK ) πk : probability of Laplacian distribution k, Π = (π 1 , , , π K ) and ∑ k πk =1 It is not easy to maximize Eq.(5) directly, and we use EM (expectation maximization) algorithm for parameter estimation. Here we introduce a new hidden context variable Z that represents which Laplacian k, is responsible for a given data point. Assuming we know the hidden variable Z, we can write the likelihood of data and Z as Eq.(6), n zk K   N r  (π )zkn   1  ⋅ exp − z n u n ,m   P(U , Z | Λ, Π ) = ∏ P(u n , Z | Λ, Π ) = ∏ ∏ k ∏      k   k λk , m n n m   2λk ,m        N               (6) n z k : Hidden binary random variable, 1 if n-th data sample is generated from k-th n Laplacian, 0 other wise. ( Z = (z kn ) and ∑ z k = 1 for all n = 1…N) k 2.2 EM algorithm for learning the mixture model The EM algorithm maximizes the log likelihood of data averaged over hidden variable Z. The log likelihood and its expectation can be computed as in Eq.(7,8).   u 1 n n log P(U , Z | Λ, Π ) = ∑  z k log(π k ) + ∑ z k  log( ) − n ,m  2λk ,m λk , m n ,k  m       (7)   u 1 n E {log P (U , Z | Λ, Π )} = ∑ E z k log(π k ) + ∑  log( ) − n ,m  2λ k , m λk , m n ,k m    { }     (8) The expectation in Eq.(8) can be evaluated, if we are given the data U and estimated parameters Λ and Π. For Λ and Π, EM algorithm uses current estimation Λ’ and Π’. { } { } ∑ z P( z n n E z k ≡ E zk | U , Λ' , Π ' = 1 n z k =0 n k n k n | u n , Λ' , Π ' ) = P( z k = 1 | u n , Λ' , Π ' ) (9) = n n P (u n | z k = 1, Λ' , Π ' ) P( z k = 1 | Λ ' , Π ' ) P(u n | Λ' , Π ' ) = M u n ,m 1 1 1 ∏ 2λ ' exp(− λ ' ) ⋅ π k ' = c P (u n | Λ ' , Π ' ) m k ,m k ,m n M πk ' ∏ 2λ m k ,m ' exp(− u n ,m λk , m ' ) Where the normalization constant can be computed as K K M k k =1 m =1 n cn = P (u n | Λ ' , Π ' ) = ∑ P (u n | z k , Λ ' , Π ' ) P ( z kn | Λ ' , Π ' ) = ∑ π k ∏ 1 (2λ ) exp( − k ,m u n ,m λk ,m ) (10) The EM algorithm works by maximizing Eq.(8), given the expectation computed from Eq.(9,10). Eq.(9,10) can be computed using Λ’ and Π’ estimated in the previous iteration of EM algorithm. This is E-step of EM algorithm. Then in M-step of EM algorithm, we need to maximize Eq.(8) over parameter Λ and Π. First, we can maximize Eq.(8) with respect to Λ, by setting the derivative as 0.  1 u n,m  ∂E{log P (U , Z | Λ, Π )} n  = 0 = ∑ E z k  − +  λ k , m (λ k , m ) 2   ∂λ k ,m  n   { } ⇒ λ k ,m ∑ E{z }⋅ u = ∑ E{z } n k n ,m n (11) n k n Second, for maximization of Eq.(8) with respect to Π, we can rewrite Eq.(8) as below. n (12) E {log P (U , Z | Λ , Π )} = C + ∑ E {z k ' }log(π k ' ) n ,k ' As we see, the derivative of Eq.(12) with respect to Π cannot be 0. Instead, we need to use Lagrange multiplier method for maximization. A Lagrange function can be defined as Eq.(14) where ρ is a Lagrange multiplier. { } (13) n L (Π , ρ ) = − ∑ E z k ' log(π k ' ) + ρ (∑ π k ' − 1) n,k ' k' By setting the derivative of Eq.(13) to be 0 with respect to ρ and Π, we can simply get the maximization solution with respect to Π. We just show the solution in Eq.(14). ∂L(Π, ρ ) ∂L(Π, ρ ) =0 = 0, ∂Π ∂ρ  n   n  ⇒ π k =  ∑ E z k  /  ∑∑ E z k     k n  n { } { } (14) Then the EM algorithm can be summarized as figure 2. For the convergence criteria, we can use the expectation of log likelihood, which can be calculated from Eq. (8). πk = { } , λk , m = E um + e (e is small random noise) 2. Calculate the Expectation by 1. Initialize 1 K u n ,m 1 M πk ' ∏ 2λ ' exp( − λ ' ) cn m k ,m k ,m 3. Maximize the log likelihood given the Expectation { } { } n n E z k ≡ E zk | U , Λ' , Π ' =     λk ,m ←  ∑ E {z kn }⋅ u n,m  /  ∑ E {z kn } ,     π k ←  ∑ E {z kn } /  ∑∑ E {z kn }   n   n   k n  4. If (converged) stop, otherwise repeat from step 2.  n Figure 2: Outline of EM algorithm for Learning the Mixture Model 3 Experimental Results Here we provide examples of image data and show how the learning procedure is performed for the mixture model. We also provide visualization of learned variances that reveal the structure of variance correlation and an application to image denoising. 3.1 Learning Nonlinear Dependencies in Natural images As shown in figure 1, the 1 st stage of the proposed model is simply the linear ICA. The ICA matrix A and W(=A-1) are learned by the FastICA algorithm [9]. We sampled 105(=N) data from 16x16 patches (256 dim.) of natural images and use them for both first and second stage learning. ICA input dimension is 256, and source dimension is set to be 160(=M). The learned ICA basis is partially shown in figure 1. The 2nd stage mixture model is learned given the ICA source signals. In the 2 nd stage the number of mixtures is set to 16, 64, or 256(=K). Training by the EM algorithm is fast and several hundred iterations are sufficient for convergence (0.5 hour on a 1.7GHz Pentium PC). For the visualization of learned variance, we adapted the visualization method from [8]. Each dimension of ICA source signal corresponds to an ICA basis (columns of A) and each ICA basis is localized in both image and frequency space. Then for each Laplacian distribution, we can display its variance vector as a set of points in image and frequency space. Each point can be color coded by variance value as figure 3. (a1) (a2) (b1) (b2) Figure 3: Visualization of learned variances (a1 and a2 visualize variance of Laplacian #4 and b1 and 2 show that of Laplacian #5. High variance value is mapped to red color and low variance is mapped to blue. In Laplacian #4, variances for diagonally oriented edges are high. But in Laplacian #5, variances for edges at spatially right position are high. Variance structures are related to “contexts” in the image. For example, Laplacian #4 explains image patches that have oriented textures or edges. Laplacian #5 captures patches where left side of the patch is clean but right side is filled with randomly oriented edges.) A key idea of our model is that we can mix up independent distributions to get nonlinearly dependent distribution. This modeling power can be shown by figure 4. Figure 4: Joint distribution of nonlinearly dependent sources. ((a) is a joint histogram of 2 ICA sources, (b) is computed from learned mixture model, and (c) is from learned Laplacian model. In (a), variance of u2 is smaller than u1 at center area (arrow A), but almost equal to u1 at outside (arrow B). So the variance of u2 is dependent on u1. This nonlinear dependency is closely approximated by mixture model in (b), but not in (c).) 3.2 Unsupervised Image Segmentation The idea behind our model is that the image can be modeled as mixture of different variance correlated “contexts”. We show how the learned model can be used to classify different context by an unsupervised image segmentation task. Given learned model and data, we can compute the expectation of a hidden variable Z from Eq. (9). Then for an image patch, we can select a Laplacian distribution with highest probability, which is the most explaining Laplacian or “context”. For segmentation, we use the model with 16 Laplacians. This enables abstract partitioning of images and we can visualize organization of images more clearly (figure 5). Figure 5: Unsupervised image segmentation (left is original image, middle is color labeled image, right image shows color coded Laplacians with variance structure. Each color corresponds to a Laplacian distribution, which represents surface or textural organization of underlying contexts. Laplacian #14 captures smooth surface and Laplacian #9 captures contrast between clear sky and textured ground scenes.) 3.3 Application to Image Restoration The proposed mixture model provides a better parametric model of the ICA source distribution and hence an improved model of the image structure. An advantage is in the MAP (maximum a posterior) estimation of a noisy image. If we assume Gaussian noise n, the image generation model can be written as Eq.(15). Then, we can compute MAP estimation of ICA source signal u by Eq.(16) and reconstruct the original image. (15) X = Au + n (16) ˆ u = argmax log P (u | X , A) = argmax (log P ( X | u , A) + log P (u ) ) u u Since we assumed Gaussian noise, P(X|u,A) in Eq. (16) is Gaussian. P(u) in Eq. (16) can be modeled as a Laplacian or a mixture of Laplacian distribution. The mixture distribution can be approximated by a maximum explaining Laplacian. We evaluated 3 different methods for image restoration including ICA MAP estimation with simple Laplacian prior, same with Laplacian mixture prior, and the Wiener filter. Figure 6 shows an example and figure 7 summarizes the results obtained with different noise levels. As shown MAP estimation with the mixture prior performs better than the others in terms of SNR and SSIM (Structural Similarity Measure) [10]. Figure 6: Image restoration results (signal variance 1.0, noise variance 0.81) 16 ICA MAP (Mixture prior) ICA MAP (Laplacian prior) W iener 14 0.8 SSIM Index SNR 12 10 8 6 0.6 0.4 0.2 4 2 ICA MAP(Mixture prior) ICA MAP(Laplacian prior) W iener Noisy Image 1 0 0.5 1 1.5 Noise variance 2 2.5 0 0 0.5 1 1.5 Noise variance 2 2.5 Figure 7: SNR and SSIM for 3 different algorithms (signal variance = 1.0) 4 D i s c u s s i on We proposed a mixture model to learn nonlinear dependencies of ICA source signals for natural images. The proposed mixture of Laplacian distribution model is a generalization of the conventional independent source priors and can model variance dependency given natural image signals. Experiments show that the proposed model can learn the variance correlated signals grouped as different mixtures and learn highlevel structures, which are highly correlated with the underlying physical properties captured in the image. Our model provides an analytic prior of nearly independent and variance-correlated signals, which was not viable in previous models [4,5,6,7,8]. The learned variances of the mixture model show structured localization in image and frequency space, which are similar to the result in [8]. Since the model is given no information about the spatial location or frequency of the source signals, we can assume that the dependency captured by the mixture model reveals regularity in the natural images. As shown in image labeling experiments, such regularities correspond to specific surface types (textures) or boundaries between surfaces. The learned mixture model can be used to discover hidden contexts that generated such regularity or correlated signal groups. Experiments also show that the labeling of image patches is highly correlated with the object surface types shown in the image. The segmentation results show regularity across image space and strong correlation with high-level concepts. Finally, we showed applications of the model for image restoration. We compare the performance with the conventional ICA MAP estimation and Wiener filter. Our results suggest that the proposed model outperforms other traditional methods. It is due to the estimation of the correlated variance structure, which provides an improved prior that has not been considered in other methods. In our future work, we plan to exploit the regularity of the image segmentation result to lean more high-level structures by building additional hierarchies on the current model. Furthermore, the application to image coding seems promising. References [1] A. J. Bell and T. J. Sejnowski, The ‘Independent Components’ of Natural Scenes are Edge Filters, Vision Research, 37(23):3327–3338, 1997. [2] A. Hyvarinen, Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation,Neural Computation, 11(7):1739-1768, 1999. [3] T. Lee, M. Lewicki, and T. Sejnowski., ICA Mixture Models for unsupervised Classification of non-gaussian classes and automatic context switching in blind separation. PAMI, 22(10), October 2000. [4] J. Portilla, V. Strela, M. J. Wainwright and E. P Simoncelli, Image Denoising using Scale Mixtures of Gaussians in the Wavelet Domain, IEEE Trans. On Image Processing, Vol.12, No. 11, 1338-1351, 2003. [5] A. Hyvarinen, P. O. Hoyer. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neurocomputing, 1999. [6] A. Hyvarinen, P.O. Hoyer, Topographic Independent component analysis as a model of V1 Receptive Fields, Neurocomputing, Vol. 38-40, June 2001. [7] M. Welling and G. E. Hinton, S. Osindero, Learning Sparse Topographic Representations with Products of Student-t Distributions, NIPS, 2002. [8] M. S. Lewicki and Y. Karklin, Learning higher-order structures in natural images, Network: Comput. Neural Syst. 14 (August 2003) 483-499. [9] A.Hyvarinen, P.O. Hoyer, Fast ICA matlab code., http://www.cis.hut.fi/projects/compneuro/extensions.html/ [10] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, The SSIM Index for Image Quality Assessment, IEEE Transactions on Image Processing, vol. 13, no. 4, Apr. 2004.

same-paper 3 0.79230857 100 nips-2004-Learning Preferences for Multiclass Problems

Author: Fabio Aiolli, Alessandro Sperduti

4 0.74231076 54 nips-2004-Distributed Information Regularization on Graphs

Author: Adrian Corduneanu, Tommi S. Jaakkola

Abstract: We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. We analyze the properties of the algorithm and demonstrate its performance on document classiﬁcation tasks. 1

5 0.6219027 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of ﬁlter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which ﬁlter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efﬁcacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reﬂections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear ﬁlter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive ﬁeld development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive ﬁeld-like ﬁlters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and ﬁnance.16 Many approaches to the unsupervised speciﬁcation of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear ﬁlters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to ﬁnd combinations of these (perhaps non-linearly transformed) as a way of ﬁnding higher order ﬁlters (eg complex cells). One critical facet whose speciﬁcation from data is not obvious is the neighborhood arrangement, ie which linear ﬁlters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for ﬁnding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, ﬁlter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in ﬁgure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n ﬁlter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modiﬁed Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to ﬁlter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM speciﬁes a hierarchy of mixer variables. Wainwright2 considered a prespeciﬁed tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being ﬁlter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each ﬁlter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic ﬁlter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to ﬁll the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of ﬁlter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by deﬁnition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the ﬁlters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear ﬁlters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which ﬁlter responses share which mixer variables. We ﬁrst study this issue using synthetic data in which two groups of ﬁlter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (ﬁgure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few ﬁlter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single ﬁlter response, the Gaussian estimate peaks away from zero. For assignments including more ﬁlter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including ﬁlter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each ﬁlter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each ﬁlter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of ﬁlter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of ﬁlter associations corresponding to vα , vβ , and vγ . C Statistics of generated ﬁlter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of ﬁlter responses increases, the estimates improve, provided that they are taken from the right group of ﬁlter responses with the same mixer variable. Speciﬁcally, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single ﬁlter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many ﬁlter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefﬁcients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in ﬁgure 1 suggest that it should be possible to infer the assignments, ie work out which ﬁlter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each ﬁlter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between ﬁlter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that ﬁlter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of ﬁlter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a ﬁlter response li , we use Gibbs sampling for the E phase to ﬁnd possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple ﬁlter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the ﬁlter response samples just to those associated with each mixer variable. We tested the ability of this inference method to ﬁnd the associations in the probabilistic mixer variable synthetic example shown in ﬁgure 2, (A,B). The true generative model speciﬁes probabilistic overlap of 3 mixer variables. We generated 5000 samples for each ﬁlter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the ﬁlter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently ﬁnds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefﬁcient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear ﬁlters from a multi-scale oriented steerable pyramid,28 with 100 ﬁlters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in ﬁgure 2). Figure 3 depicts the association weights pij of the coefﬁcients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefﬁcients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefﬁcient to be generated from a given mixer variable. For the ﬁrst two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The ﬁrst neighborhood (in B) prefers vertical patterns across most of its “receptive ﬁeld”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefﬁcient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal ﬁlters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with ﬁlter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between ﬁlters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the ﬁrst 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefﬁcients of two spatially displaced vertical ﬁlters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet ﬁlters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA ﬁrst-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet ﬁlters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefﬁcients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In ﬁgure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial conﬁgurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial conﬁguration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal conﬁguration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefﬁcients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to ﬁlter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method ﬁrmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent ﬂows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefﬁcients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artiﬁcial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge ﬁlters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

6 0.6000424 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

7 0.59788525 145 nips-2004-Parametric Embedding for Class Visualization

8 0.58855677 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

9 0.58801657 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis

10 0.58495682 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

11 0.58349526 77 nips-2004-Hierarchical Clustering of a Mixture Model

12 0.58237201 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

13 0.58087009 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

14 0.57174593 131 nips-2004-Non-Local Manifold Tangent Learning

15 0.57158577 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

16 0.57146835 165 nips-2004-Semi-supervised Learning on Directed Graphs

17 0.57109261 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

18 0.57008475 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

19 0.56952423 102 nips-2004-Learning first-order Markov models for control

20 0.56911552 69 nips-2004-Fast Rates to Bayes for Kernel Machines