nips nips2002 nips2002-133 knowledge-graph by maker-knowledge-mining

133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes

Source: pdf

Author: Anat Levin, Assaf Zomet, Yair Weiss

Abstract: Certain simple images are known to trigger a percept of transparency: the input image I is perceived as the sum of two images I(x, y) = I1 (x, y) + I2 (x, y). This percept is puzzling. First, why do we choose the “more complicated” description with two images rather than the “simpler” explanation I(x, y) = I1 (x, y) + 0 ? Second, given the inﬁnite number of ways to express I as a sum of two images, how do we compute the “best” decomposition ? Here we suggest that transparency is the rational percept of a system that is adapted to the statistics of natural scenes. We present a probabilistic model of images based on the qualitative statistics of derivative ﬁlters and “corner detectors” in natural scenes and use this model to ﬁnd the most probable decomposition of a novel image. The optimization is performed using loopy belief propagation. We show that our model computes perceptually “correct” decompositions on synthetic images and discuss its application to real images. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract Certain simple images are known to trigger a percept of transparency: the input image I is perceived as the sum of two images I(x, y) = I1 (x, y) + I2 (x, y). [sent-4, score-0.885]

2 First, why do we choose the “more complicated” description with two images rather than the “simpler” explanation I(x, y) = I1 (x, y) + 0 ? [sent-6, score-0.193]

3 Second, given the inﬁnite number of ways to express I as a sum of two images, how do we compute the “best” decomposition ? [sent-7, score-0.164]

4 Here we suggest that transparency is the rational percept of a system that is adapted to the statistics of natural scenes. [sent-8, score-0.747]

5 We present a probabilistic model of images based on the qualitative statistics of derivative ﬁlters and “corner detectors” in natural scenes and use this model to ﬁnd the most probable decomposition of a novel image. [sent-9, score-0.773]

6 We show that our model computes perceptually “correct” decompositions on synthetic images and discuss its application to real images. [sent-11, score-0.534]

7 1 Introduction Figure 1a shows a simple image that evokes the percept of transparency. [sent-12, score-0.451]

8 The image is typically perceived as a superposition of two layers: either a light square with a dark semitransparent square in front of it or a dark square with a light semitransparent square in front of it. [sent-13, score-0.595]

9 Mathematically, our visual system is taking a single image I(x, y) and representing as the sum of two images: I1 (x, y) + I2 (x, y) = I(x, y) (1) When phrased this way, the decomposition is surprising. [sent-14, score-0.333]

10 There are obviously an inﬁnite number of solutions to equation 1, how does our visual system choose one? [sent-15, score-0.136]

11 A simple image that evokes the percept of transparency. [sent-18, score-0.451]

12 A simple image that does not evoke the percept of transparency. [sent-20, score-0.444]

13 Figure 1b shows a similar image that does not evoke the percept of transparency. [sent-21, score-0.444]

14 Here again there are an inﬁnite number of solutions to equation 1 but our visual system prefers the single layer solution. [sent-22, score-0.39]

15 Studies of the conditions for the percept of transparency go back to the very ﬁrst research on visual perception (see [1] and references within). [sent-23, score-0.714]

16 Research of this type has made great progress in understanding the types of junctions and their eﬀects (e. [sent-24, score-0.166]

17 X junctions of a certain type trigger transparency, T junctions do not). [sent-26, score-0.368]

18 While equation 1 has an inﬁnite number of possible solutions, if we have prior probabilities P (I1 (x, y)), P (I2 (x, y)) then some of these solutions will be more probable than others. [sent-29, score-0.245]

19 We use the statistics of natural images to deﬁne simple priors and ﬁnally use loopy belief propagation to ﬁnd the most probable decomposition. [sent-30, score-0.676]

20 We show that while the model knows nothing about “T junctions” or “X junctions”, it can generate perceptually correct decompositions from a single image. [sent-31, score-0.367]

21 2 Statistics of natural images A remarkably robust property of natural images that has received much attention lately is the fact that when derivative ﬁlters are applied to natural images, the ﬁlter outputs tend to be sparse [5, 7]. [sent-32, score-0.635]

22 Figure 2 illustrates this fact: the histogram of the horizontal derivative ﬁlter is peaked at zero and fall oﬀ much faster than a Gaussian. [sent-33, score-0.206]

23 Similar histograms are observed for vertical derivative ﬁlters and for the gradient magnitude: | I|. [sent-34, score-0.305]

24 There is a qualitative diﬀerence between distributions for which α > 1 (when the log probability is convex) and those for which α < 1 (when it becomes concave). [sent-41, score-0.137]

25 As ﬁgure 2d shows, the natural statistics for derivative deriv ﬁlter corner operator 5 5 2. [sent-42, score-0.501]

26 In [9] the sparsity of derivative ﬁlters was used to decompose an image sequence as a sum of two image sequences. [sent-67, score-0.408]

27 Note that decomposing the image in ﬁgure 1a into two layers does not change the output of derivative ﬁlters: exactly the same derivatives exist in the single layer solution as in the two layer solution. [sent-69, score-0.669]

28 Thus we cannot appeal to the marginal histogram of derivative ﬁlters to explain the percept of transparency in this image. [sent-70, score-0.863]

29 There are two ways to go beyond marginal histograms of derivative ﬁlters. [sent-71, score-0.254]

30 We can either look at joint statistics of derivative ﬁlters at diﬀerent locations or orientations [6] or look at marginal statistics of more complicated feature detectors (e. [sent-72, score-0.411]

31 Figures 2e,f show the histogram of this corner operator on a typical natural image. [sent-77, score-0.415]

32 To get a more quantitative description of the statistics we used maximum likelihood α 1 to ﬁt a distribution of the form P (x) = Z e−ax to gradient magnitudes and corner detector histograms in a number of images. [sent-79, score-0.653]

33 We found that the histograms shown in ﬁgure 2 are typical: for both gradients and corner detectors the exponent was less than 1 and the exponent for the corner detector was smaller than that of the gradients. [sent-80, score-1.036]

34 The scaling parameter a of the corner detector was typically larger than that of the gradient magnitude. [sent-84, score-0.456]

35 3 Simple prior predicts transparency Motivated by the qualitative statistics observed in natural images we now deﬁne a probability distribution over images. [sent-85, score-0.759]

36 We deﬁne the log probability of an image by means of a probability over its gradients: | I(x, y)|α + ηc(x, y)β log P (Ix , Iy ) = − log Z − (3) x,y with α = 0. [sent-86, score-0.324]

37 The parameter η was determined by the ratio of the scaling parameters in the corner and gradient distributions. [sent-89, score-0.354]

38 Given a candidate decomposition of an image I into I1 and I2 = I − I1 we deﬁne the log probability of the decomposition as the sum of the log probabilities of the gradients of I1 and I2 . [sent-90, score-0.713]

39 Of course this is only an approximation: we are ignoring dependencies between the gradients across space and orientation. [sent-91, score-0.155]

40 That is, is the most probable interpretation of ﬁgure 1a one with two layers and the most probable decomposition of ﬁgure 1b one with a single layer? [sent-93, score-0.471]

41 We deﬁned s(x, y) the image of a single white square in the same location as the bottom right square in ﬁgure 1a,b. [sent-96, score-0.275]

42 We considered decompositions of the form I1 = γs(x, y),I2 = I − I1 and evaluated the log probability for values of γ between −1 and 2. [sent-97, score-0.315]

43 The most probable decomposition is the one that agrees with the percept: γ = 1 one layer for the white square and another for the gray square. [sent-99, score-0.517]

44 The most probable decomposition again agrees with the percept: γ = 0 so that one layer is zero and the second contains the full image. [sent-101, score-0.46]

45 1 The importance of being non Gaussian Equation 3 can be verbally described as preferring decompositions where the total edge and corner detector magnitudes are minimal. [sent-103, score-0.701]

46 Figure 3c shows the result with α = β = 2 for the transparency ﬁgure (ﬁgure 1a). [sent-105, score-0.354]

47 This would be the optimal interpretation if the marginal histograms of edge and corner detectors were Gaussian. [sent-106, score-0.474]

48 Thus the non Gaussian nature of the histograms is crucial for getting the transparency percept. [sent-108, score-0.523]

49 Similar “non perceptual” decompositions are obtained with other values of α, β > 1. [sent-109, score-0.247]

50 We can get some intuition for the importance of having exponents smaller than 1 from the following observation which considers the analog of the transparency problem with scalars. [sent-110, score-0.417]

51 Observation: The MAP solution to the scalar transparency problem is obtained with a = 1, b = 0 or a = 0, b = 1 if and only if log P (x) is concave. [sent-112, score-0.422]

52 negative log probability (equation 3) for a sequence of decompositions of ﬁgure 1a,b respectively. [sent-115, score-0.315]

53 The ﬁrst layer is always a single square with contrast γ and the second layer is shown in the insets. [sent-116, score-0.393]

54 negative log probability (equation 3) for a sequence of decompositions of ﬁgure 1a with α = β = 2. [sent-118, score-0.315]

55 4 Optimization using loopy BP Finding the most likely decomposition requires a highly nonlinear optimization. [sent-119, score-0.228]

56 We chose to discretize the problem and use max-product loopy belief propagation to ﬁnd the optimum. [sent-120, score-0.29]

57 We deﬁned a graphical model in which every node gi corresponded to a discretization of the gradient of one layer I1 at that location gi = (gix , giy )T . [sent-121, score-0.875]

58 For every value of gi we deﬁned fi which represents the gradient of the second layer at that location: fi = (Ix , Iy )T − gi . [sent-122, score-0.968]

59 Thus the two gradients ﬁelds {gi }, {fi } represent a valid decomposition of the input image I. [sent-123, score-0.413]

60 The joint probability is given by: P (g) = 1 Z Ψi (gi ) i Ψijkl (gi , gj , gk , gl ) (4) where < ijkl > refers to four adjacent pixels that form a 2x2 local square. [sent-124, score-0.486]

61 The local potential Ψi (gi ) is based on the histograms of derivative ﬁlters: Ψi (gi ) = e(−|g| α −|f |α )/T (5) where T is an arbitrary system “temperature”. [sent-125, score-0.234]

62 Nevertheless motivated by the recent results on similar graphs [2, 3] we ran the max-product belief propagation algorithm on it. [sent-129, score-0.16]

63 The max-product algorithm ﬁnds a gradient ﬁeld {g i } that is a local maximum of equation 4 with respect to a large neighbourhood [10]. [sent-130, score-0.142]

64 This gradient ﬁeld also deﬁnes the complementary gradient ﬁeld {fi } and ﬁnally we integrate the two gradient ﬁelds to ﬁnd the two layers. [sent-131, score-0.294]

65 Since equation 4 is completely symmetric in {f } and {g} we break the symmetry by requiring that the gradient in a single location gi0 belong to layer 1. [sent-132, score-0.351]

66 In order to run BP we need to somehow discretize the space of possible gradients at each pixel. [sent-133, score-0.195]

67 The algorithm eﬀectively searches over an exponentially large number of possible decompositions and chooses decompositions that agree with the percept. [sent-135, score-0.521]

68 sample a small number of candidate gradients at each pixel. [sent-136, score-0.155]

69 Since the local potential penalizes non zero gradients, the most probable candidates are gi = (Ix , Iy ) and gi = (0, 0). [sent-137, score-0.838]

70 We also added two more candidates at each pixel gi = (Ix , 0) and gi = (0, Iy ). [sent-138, score-0.655]

71 With this discretization there are still an exponential number of possible decompositions of the image. [sent-139, score-0.293]

72 Figure 4 shows the output of the algorithm on the two images in ﬁgure 1. [sent-141, score-0.219]

73 An animation that illustrates the dynamics of BP on these images is available at www. [sent-142, score-0.221]

74 Note that the algorithm is essentially searching exponentially many decompositions of the input images and knows nothing about “X junctions” or “T junctions” or squares. [sent-147, score-0.503]

75 Yet it ﬁnds the decompositions that are consistent with the human percept. [sent-148, score-0.247]

76 Will our simple prior also allow us to decompose a sum of two real images ? [sent-149, score-0.33]

77 We found that for real images that have very little texture (e. [sent-151, score-0.354]

78 However, nearly any other image that we tried had some texture and on such images the model failed (e. [sent-154, score-0.492]

79 When there is texture in both layers, the model always prefers a one layer decomposition: the input image plus a zero image. [sent-157, score-0.498]

80 To understand this failure, recall that the model prefers decompositions that have few corners and few edges. [sent-158, score-0.432]

81 According to the simple “edge” and “corner” operators that we have used, real images have edges and corners at nearly every pixel so the two layer decomposition has twice as many edges and corners as the one layer decomposition. [sent-159, score-1.083]

82 To decompose general real images we need to use more sophisticated features to deﬁne our prior. [sent-160, score-0.27]

83 Even for images with little texture standard belief propagation with synchronous a b c d Figure 5: When we sum two arbitrary images (e. [sent-161, score-0.696]

84 This is because of the texture that results in gradients and corners at every pixel. [sent-165, score-0.378]

85 For real images that are relatively texture free (e. [sent-166, score-0.354]

86 ) the model does prefer splitting into two layers (c. [sent-169, score-0.115]

87 First, we manually divided the input image into smaller patches and ran BP separately on each patch. [sent-173, score-0.12]

88 Second, to minimize discretization artifacts we used a diﬀerent number of gradient candidates at each pixel and always included the gradients of the original images in the list of candidates at that pixel. [sent-174, score-0.726]

89 Third, to avoid giving too much weight to corners and edges in textured regions, we increased the temperature at pixels where the gradient magnitude was not a local maximum. [sent-175, score-0.265]

90 In preliminary experiments we have found that similar results can be obtained with far less tweaking when we use generalized belief propagation to do the optimization. [sent-177, score-0.2]

91 5 Discussion The percept of transparency is a paradigmatic example of the ill-posedness of vision: the number of equations is half the number of unknowns. [sent-178, score-0.638]

92 Nevertheless our visual systems reliably and eﬀectively compute a decomposition of a single image into two images. [sent-179, score-0.307]

93 In this paper we have argued that this perceptual decomposition may correspond to the most probable decomposition using a simple prior over images derived from natural scene statistics. [sent-180, score-0.676]

94 However, our experiments with real images show that this simple prior is not powerful enough. [sent-182, score-0.264]

95 One way to do this is by deﬁning features that look for “texture edges” and “texture corners” and measuring their statistics in real images. [sent-184, score-0.135]

96 A theory for multiresolution signal decomposition : the wavelet representation. [sent-220, score-0.207]

97 A parametric texture model based on joint statistics of complex wavelet coeﬃcients. [sent-234, score-0.253]

98 Bayesian denoising of visual images in the wavelet domain. [sent-246, score-0.311]

99 On the optimality of solutions of the maxproduct belief propagation algorithm in arbitrary graphs. [sent-260, score-0.203]

100 Minimax entropy principle and its application to texture modeling. [sent-263, score-0.124]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('transparency', 0.354), ('percept', 0.284), ('gi', 0.261), ('corner', 0.256), ('decompositions', 0.247), ('images', 0.193), ('layer', 0.168), ('junctions', 0.166), ('gradients', 0.155), ('iy', 0.152), ('gk', 0.142), ('decomposition', 0.138), ('texture', 0.124), ('ix', 0.124), ('probable', 0.124), ('gl', 0.121), ('image', 0.12), ('gj', 0.114), ('gure', 0.113), ('ijkl', 0.109), ('histograms', 0.105), ('derivative', 0.102), ('detector', 0.102), ('candidates', 0.101), ('corners', 0.099), ('gradient', 0.098), ('lters', 0.092), ('loopy', 0.09), ('fi', 0.09), ('belief', 0.09), ('prefers', 0.086), ('layers', 0.085), ('bp', 0.08), ('histogram', 0.076), ('propagation', 0.07), ('wavelet', 0.069), ('qualitative', 0.069), ('log', 0.068), ('detectors', 0.066), ('non', 0.064), ('prob', 0.06), ('statistics', 0.06), ('square', 0.057), ('perceptually', 0.057), ('fourway', 0.054), ('semitransparent', 0.054), ('det', 0.052), ('visual', 0.049), ('natural', 0.049), ('exponent', 0.048), ('integrability', 0.047), ('evokes', 0.047), ('marginal', 0.047), ('operators', 0.046), ('discretization', 0.046), ('equation', 0.044), ('ectively', 0.043), ('lightness', 0.043), ('lter', 0.043), ('solutions', 0.043), ('location', 0.041), ('discretize', 0.04), ('tweaking', 0.04), ('evoke', 0.04), ('decompose', 0.04), ('scenes', 0.038), ('edges', 0.038), ('look', 0.038), ('exponents', 0.038), ('real', 0.037), ('trigger', 0.036), ('fk', 0.036), ('knows', 0.035), ('fj', 0.035), ('fl', 0.035), ('operator', 0.034), ('prior', 0.034), ('weiss', 0.033), ('perceived', 0.033), ('pixel', 0.032), ('magnitudes', 0.032), ('elds', 0.031), ('jerusalem', 0.031), ('prefer', 0.03), ('temperature', 0.03), ('agrees', 0.03), ('nothing', 0.028), ('tried', 0.028), ('illustrates', 0.028), ('agree', 0.027), ('perception', 0.027), ('nearly', 0.027), ('potential', 0.027), ('front', 0.027), ('eld', 0.027), ('sum', 0.026), ('dark', 0.026), ('output', 0.026), ('di', 0.025), ('intuition', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes

Author: Anat Levin, Assaf Zomet, Yair Weiss

2 0.16479039 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

Author: David R. Martin, Charless C. Fowlkes, Jitendra Malik

Abstract: The goal of this work is to accurately detect and localize boundaries in natural scenes using local image measurements. We formulate features that respond to characteristic changes in brightness and texture associated with natural boundaries. In order to combine the information from these features in an optimal way, a classiﬁer is trained using human labeled images as ground truth. We present precision-recall curves showing that the resulting detector outperforms existing approaches.

3 0.1470934 10 nips-2002-A Model for Learning Variance Components of Natural Images

Author: Yan Karklin, Michael S. Lewicki

Abstract: We present a hierarchical Bayesian model for learning efﬁcient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefﬁcients with a distribution that is adapted to the variance structure of the coefﬁcients of an efﬁcient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.

4 0.12571985 94 nips-2002-Fractional Belief Propagation

Author: Wim Wiegerinck, Tom Heskes

Abstract: We consider loopy belief propagation for approximate inference in probabilistic graphical models. A limitation of the standard algorithm is that clique marginals are computed as if there were no loops in the graph. To overcome this limitation, we introduce fractional belief propagation. Fractional belief propagation is formulated in terms of a family of approximate free energies, which includes the Bethe free energy and the naive mean-ﬁeld free as special cases. Using the linear response correction of the clique marginals, the scale parameters can be tuned. Simulation results illustrate the potential merits of the approach.

5 0.12007726 74 nips-2002-Dynamic Structure Super-Resolution

Author: Amos J. Storkey

Abstract: The problem of super-resolution involves generating feasible higher resolution images, which are pleasing to the eye and realistic, from a given low resolution image. This might be attempted by using simple ﬁlters for smoothing out the high resolution blocks or through applications where substantial prior information is used to imply the textures and shapes which will occur in the images. In this paper we describe an approach which lies between the two extremes. It is a generic unsupervised method which is usable in all domains, but goes beyond simple smoothing methods in what it achieves. We use a dynamic tree-like architecture to model the high resolution data. Approximate conditioning on the low resolution image is achieved through a mean ﬁeld approach. 1

6 0.11523493 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

7 0.11486144 96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

8 0.11467677 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

9 0.10903202 39 nips-2002-Bayesian Image Super-Resolution

10 0.10332717 173 nips-2002-Recovering Intrinsic Images from a Single Image

11 0.084414773 126 nips-2002-Learning Sparse Multiscale Image Representations

12 0.082453817 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

13 0.079267956 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems

14 0.071821705 202 nips-2002-Unsupervised Color Constancy

15 0.070867762 2 nips-2002-A Bilinear Model for Sparse Coding

16 0.069041729 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip

17 0.066442013 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

18 0.066099361 49 nips-2002-Charting a Manifold

19 0.065512978 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search

20 0.065208383 206 nips-2002-Visual Development Aids the Acquisition of Motion Velocity Sensitivities

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.185), (1, 0.007), (2, -0.071), (3, 0.243), (4, 0.008), (5, 0.036), (6, 0.057), (7, 0.129), (8, -0.123), (9, -0.042), (10, -0.054), (11, 0.006), (12, -0.005), (13, 0.014), (14, 0.085), (15, 0.024), (16, 0.044), (17, -0.01), (18, 0.007), (19, 0.007), (20, -0.082), (21, -0.027), (22, 0.022), (23, 0.013), (24, 0.094), (25, 0.056), (26, 0.022), (27, -0.015), (28, 0.082), (29, -0.094), (30, 0.047), (31, -0.01), (32, -0.078), (33, -0.105), (34, 0.042), (35, 0.045), (36, 0.01), (37, -0.007), (38, -0.048), (39, 0.116), (40, 0.071), (41, -0.002), (42, 0.041), (43, 0.017), (44, -0.084), (45, -0.045), (46, 0.083), (47, -0.15), (48, -0.087), (49, 0.115)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95640391 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes

Author: Anat Levin, Assaf Zomet, Yair Weiss

2 0.60420495 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

Author: David R. Martin, Charless C. Fowlkes, Jitendra Malik

3 0.55464053 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

Author: Max Welling, Simon Osindero, Geoffrey E. Hinton

Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some ﬁlter outputs. We encourage the system to ﬁnd sparse features by using a Studentt distribution to model each ﬁlter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent ﬁlters, the system learns a topographic map in which the orientation, spatial frequency and location of the ﬁlters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efﬁcient learning procedure that works well even for highly overcomplete sets of ﬁlters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener ﬁlter” for the purpose of denoising images.

4 0.51798868 173 nips-2002-Recovering Intrinsic Images from a Single Image

Author: Marshall F. Tappen, William T. Freeman, Edward H. Adelson

Abstract: We present an algorithm that uses multiple cues to recover shading and reﬂectance intrinsic images from a single image. Using both color information and a classiﬁer trained to recognize gray-scale patterns, each image derivative is classiﬁed as being caused by shading or a change in the surface’s reﬂectance. Generalized Belief Propagation is then used to propagate information from areas where the correct classiﬁcation is clear to areas where it is ambiguous. We also show results on real images.

5 0.51411611 126 nips-2002-Learning Sparse Multiscale Image Representations

Author: Phil Sallee, Bruno A. Olshausen

Abstract: We describe a method for learning sparse multiscale image representations using a sparse prior distribution over the basis function coeﬃcients. The prior consists of a mixture of a Gaussian and a Dirac delta function, and thus encourages coeﬃcients to have exact zero values. Coeﬃcients for an image are computed by sampling from the resulting posterior distribution with a Gibbs sampler. The learned basis is similar to the Steerable Pyramid basis, and yields slightly higher SNR for the same number of active coeﬃcients. Denoising using the learned image model is demonstrated for some standard test images, with results that compare favorably with other denoising methods. 1

6 0.49600998 74 nips-2002-Dynamic Structure Super-Resolution

7 0.49342808 39 nips-2002-Bayesian Image Super-Resolution

8 0.46461439 10 nips-2002-A Model for Learning Variance Components of Natural Images

9 0.45313814 96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

10 0.41984099 2 nips-2002-A Bilinear Model for Sparse Coding

11 0.40374124 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

12 0.39614409 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

13 0.39612541 94 nips-2002-Fractional Belief Propagation

14 0.3892794 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

15 0.37566593 169 nips-2002-Real-Time Particle Filters

16 0.36812371 195 nips-2002-The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities

17 0.36561018 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems

18 0.35947257 111 nips-2002-Independent Components Analysis through Product Density Estimation

19 0.3593691 202 nips-2002-Unsupervised Color Constancy

20 0.35820705 32 nips-2002-Approximate Inference and Protein-Folding

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.386), (11, 0.023), (23, 0.027), (42, 0.059), (54, 0.107), (55, 0.024), (57, 0.029), (64, 0.018), (67, 0.015), (68, 0.021), (74, 0.079), (75, 0.012), (92, 0.04), (98, 0.088)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83095562 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

2 0.82742703 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity

Author: Giacomo Indiveri

Abstract: We present analog neuromorphic circuits for implementing bistable synapses with spike-timing-dependent plasticity (STDP) properties. In these types of synapses, the short-term dynamics of the synaptic efﬁcacies are governed by the relative timing of the pre- and post-synaptic spikes, while on long time scales the efﬁcacies tend asymptotically to either a potentiated state or to a depressed one. We fabricated a prototype VLSI chip containing a network of integrate and ﬁre neurons interconnected via bistable STDP synapses. Test results from this chip demonstrate the synapse’s STDP learning properties, and its long-term bistable characteristics.

same-paper 3 0.79874533 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes

Author: Anat Levin, Assaf Zomet, Yair Weiss

4 0.57784826 120 nips-2002-Kernel Design Using Boosting

Author: Koby Crammer, Joseph Keshet, Yoram Singer

Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efﬁcient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a ﬁxed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classiﬁcation, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classiﬁcation and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predeﬁned kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predeﬁned kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for speciﬁc tasks such as text categorization [3] and protein classiﬁcation problems [4]. Our work attempts to give a computational alternative to predeﬁned kernels by learning kernel operators from data. We start with a few deﬁnitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a ﬁnite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The speciﬁc setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to ﬁnd a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be deﬁned shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this deﬁnition, they deﬁned the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a ﬁnite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “ﬂattened” into a vector of dimension m2 . Therefore, this deﬁnition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself sufﬁces to construct an excellent classiﬁer f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to ﬁnd a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classiﬁer whose empirical loss is zero. Furthermore, the task of ﬁnding a good kernel when it is not always possible to ﬁnd a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difﬁculties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore ﬁrst discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classiﬁcation error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] uniﬁes the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classiﬁcation error. Given the prediction of a classiﬁer f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are deﬁned as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classiﬁer f (x). This classiﬁer is a weighted combination of (possibly very simple) base classiﬁers. (In the boosting framework, the base classiﬁers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to ﬁnd a good of weights αt . Our starting point in this paper is the ﬁrst sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-ﬂy. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reﬂects how difﬁcult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classiﬁcation algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspeciﬁed. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we deﬁne the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this deﬁnition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a speciﬁc base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We deﬁne the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw ﬁts the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efﬁciently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now deﬁne a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to ﬁnding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semideﬁnite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for ﬁnding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efﬁciently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the ﬁrst two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the ﬁrst two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (ﬁrst and third from left) and after learning (second and fourth). For each of the two settings we show the ﬁrst two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we ﬁrst learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classiﬁer on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classiﬁer with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the ﬁgure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the ﬁrst two features. In all plots, each bar indicates a 95% conﬁdence level. It is clear from the ﬁgure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reﬂects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the beneﬁts of using a boutique kernel. The ﬁrst and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the ﬁrst eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the ﬁrst and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the ﬁrst group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classiﬁcation problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classiﬁers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classiﬁcation problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the ﬁrst ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as signiﬁcant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classiﬁers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classiﬁcation using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classiﬁcation. In Proceedings of the Paciﬁc Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-deﬁnite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.

5 0.43639305 140 nips-2002-Margin Analysis of the LVQ Algorithm

Author: Koby Crammer, Ran Gilad-bachrach, Amir Navot, Naftali Tishby

Abstract: Prototypes based algorithms are commonly used to reduce the computational complexity of Nearest-Neighbour (NN) classiﬁers. In this paper we discuss theoretical and algorithmical aspects of such algorithms. On the theory side, we present margin based generalization bounds that suggest that these kinds of classiﬁers can be more accurate then the 1-NN rule. Furthermore, we derived a training algorithm that selects a good set of prototypes using large margin principles. We also show that the 20 years old Learning Vector Quantization (LVQ) algorithm emerges naturally from our framework. 1

6 0.434728 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

7 0.43071294 74 nips-2002-Dynamic Structure Super-Resolution

8 0.42622545 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip

9 0.42361781 45 nips-2002-Boosted Dyadic Kernel Discriminants

10 0.42298615 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

11 0.41716194 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

12 0.41653419 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain

13 0.41442764 3 nips-2002-A Convergent Form of Approximate Policy Iteration

14 0.41407669 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

15 0.413389 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

16 0.41313326 135 nips-2002-Learning with Multiple Labels

17 0.41246301 124 nips-2002-Learning Graphical Models with Mercer Kernels

18 0.41162041 105 nips-2002-How to Combine Color and Shape Information for 3D Object Recognition: Kernels do the Trick

19 0.41145325 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics

20 0.41139102 10 nips-2002-A Model for Learning Variance Components of Natural Images