nips nips2007 nips2007-45 knowledge-graph by maker-knowledge-mining

45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

Source: pdf

Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum

Abstract: We present a simple new criterion for classiﬁcation, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We prove asymptotic optimality of this criterion for Gaussian data and analyze its relationships to classical classiﬁers. Theoretical results provide new insights into relationships among popular classiﬁers such as MAP and RDA, as well as unsupervised clustering methods based on lossy compression [13]. Minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small-sample setting. Compression also provides a uniform means of handling classes of varying dimension. This simple classiﬁcation criterion and its kernel and local versions perform competitively against existing classiﬁers on both synthetic examples and real imagery data such as handwritten digits and human faces, without requiring domain-speciﬁc information. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We present a simple new criterion for classiﬁcation, based on principles from lossy data compression. [sent-3, score-0.235]

2 The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. [sent-4, score-0.33]

3 We prove asymptotic optimality of this criterion for Gaussian data and analyze its relationships to classical classiﬁers. [sent-5, score-0.123]

4 Theoretical results provide new insights into relationships among popular classiﬁers such as MAP and RDA, as well as unsupervised clustering methods based on lossy compression [13]. [sent-6, score-0.144]

5 Minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small-sample setting. [sent-7, score-0.408]

6 This simple classiﬁcation criterion and its kernel and local versions perform competitively against existing classiﬁers on both synthetic examples and real imagery data such as handwritten digits and human faces, without requiring domain-speciﬁc information. [sent-9, score-0.207]

7 When the conditional class distributions pX|Y (x|y) and the class priors pY (y) are given, the maximum a posterior (MAP) assignment y (x) = arg miny∈{1,. [sent-19, score-0.083]

8 This amounts to a minimum coding length principle: the optimal classiﬁer minimizes the Shannon optimal (lossless) coding length of the test data x with respect to the distribution of the true class. [sent-23, score-0.526]

9 The ﬁrst term is the number of bits needed to code x w. [sent-24, score-0.15]

10 the distribution of class y, and the second term is the number of bits needed to code the label y for x. [sent-27, score-0.196]

11 Conventional approaches to model estimation (implicitly) assume that the distributions are nondegenerate and the samples are sufﬁciently dense. [sent-31, score-0.095]

12 For instance, the set of images of a human face taken from different angles and under different lighting conditions often lie in a low-dimensional subspace or submanifold of the ambient space [2]. [sent-33, score-0.128]

13 As a result, the associated distributions are degenerate or nearly degenerate. [sent-34, score-0.073]

14 For instance, when detecting a face in an image, features associated with the face often have a low-dimensional structure which is “embedded” as a submanifold in a cloud of essentially random features from the background. [sent-42, score-0.116]

15 Model selection criteria such as minimum description length (MDL) [12, 16] serve as important modiﬁcations to MAP for model estimation across classes of different complexity. [sent-43, score-0.131]

16 It selects the model that minimizes the overall coding length of the given (training) data, hence the name “minimum description length” [1]. [sent-44, score-0.268]

17 Given the difﬁculty of learning the (potentially degenerate) distributions pX|Y (x|y) from a few samples in a high-dimensional space, it makes more sense to seek good “surrogates” for implementing the minimum coding length principle (1). [sent-47, score-0.362]

18 Our idea is to measure how efﬁciently a new observation can be encoded by each class of the training data subject to an allowable distortion, and to assign the new observation to the class that requires the minimum number of additional bits. [sent-48, score-0.175]

19 We dub this criterion “minimum incremental coding length” (MICL) for classiﬁcation. [sent-49, score-0.309]

20 It provides a counterpart of the MDL principle for model estimation and as a surrogate for the minimum coding length principle for classiﬁcation. [sent-50, score-0.324]

21 The proposed MICL criterion naturally addresses the issues of regularization and model complexity. [sent-51, score-0.118]

22 Regularization is introduced through the use of lossy coding, i. [sent-52, score-0.144]

23 coding the test data x upto an allowable distortion1 (placing our approach along the lines of lossy MDL [15]). [sent-54, score-0.384]

24 This contrasts with Shannon’s optimal lossless coding length, which requires precise knowledge of the true distributions. [sent-55, score-0.196]

25 Lossy coding length also accounts for model complexity by directly measuring the difference in the volume (hence dimension) of the training data with and without the new observation. [sent-56, score-0.268]

26 Those methods chose a decision boundary that minimizes the total number of bits needed to code the boundary and the samples it incorrectly classiﬁes. [sent-59, score-0.266]

27 In contrast, MICL uses coding length directly as a measure of how well the training data represent the new sample. [sent-60, score-0.268]

28 Within the lossy data coding framework, we establish that the MICL criterion leads to a family of classiﬁers that generalize the conventional MAP classiﬁer (1). [sent-62, score-0.427]

29 We prove that for Gaussian distributions, the MICL criterion asymptotically converges to a regularized version of MAP2 (see Theorem 1) and give a precise estimate of the convergence rate (see Theorem 2). [sent-63, score-0.162]

30 Thus, lossy coding induces a regularization effect similar to Regularized Discriminant Analysis (RDA) [6], with similar gains in ﬁnite sample performance with respect to MAP/QDA. [sent-64, score-0.345]

31 We apply lossy coding in the supervised (classiﬁcation) setting. [sent-66, score-0.318]

32 However, that method is sensitive to the choice of prior when the number of samples is less than the dimension of the space, a situation that poses no difﬁculty to our proposed classiﬁer. [sent-69, score-0.068]

33 When the distributions involved are not Gaussian, the MICL criterion can still be applied locally, similar to the popular k-Nearest Neighbor (k-NN) classiﬁer. [sent-70, score-0.116]

34 However, the local MICL classiﬁer signiﬁcantly improves the k-NN classiﬁer as it accounts for both the number of samples and the distribution of the samples within the neighborhood. [sent-71, score-0.101]

35 The kernelized version of MICL provides a simple alternative to the SVM approach of constructing a linear decision boundary in the embedded (kernel) space, and better exploits the covariance structure of the embedded data. [sent-73, score-0.112]

36 A lossy coding scheme [5] maps vectors X = (x1 , . [sent-76, score-0.318]

37 , xm ) ∈ Rn×m to a sequence of binary bits, ˆ from which the original vectors can be recovered upto an allowable distortion E[ x − x 2 ] ≤ ε2 . [sent-79, score-0.137]

38 of training data Xj = {xi : yi = j} separately using Lε (Xj ) bits, the entire training dataset can be K represented by a two-part code using j=1 Lε (Xj ) − |Xj | log2 pY (j) bits. [sent-82, score-0.124]

39 Here, the second term is the minimum number of bits needed to (losslessly) code the class labels yi . [sent-83, score-0.231]

40 If we code x jointly with the training data Xj of the jth class, the number of additional bits needed to code the pair (x, y) is δLε (x, j) = Lε (Xj ∪{x})−Lε (Xj )+L(j). [sent-85, score-0.228]

41 Here, the ﬁrst two terms measure the excess bits needed to code (x, Xj ) upto distortion ε2 , while the last term L(j) is the cost of losslessly coding the label y(x) = j. [sent-86, score-0.485]

42 One may view these as “ﬁnite-sample lossy” surrogates for the Shannon coding lengths in the ideal classiﬁer (1). [sent-87, score-0.196]

43 Assign x to the class which minimizes the number of additional bits needed to code (x, y ), subject to the distortion ε: ˆ . [sent-89, score-0.28]

44 ˆ (2) The above criterion (2) can be taken as a general principle for classiﬁcation, in the sense that it can be applied using any lossy coding scheme. [sent-94, score-0.434]

45 Nevertheless, effective classiﬁcation demands that the chosen coding scheme be approximately optimal for the given data. [sent-95, score-0.174]

46 We will ﬁrst consider a coding length function Lε introduced and rigorously justiﬁed in [13], which is (asymptotically) optimal for Gaussians. [sent-98, score-0.237]

47 The (implicit) use of a coding scheme which is optimal for Gaussian sources is equivalent to assuming that the conditional class distributions pX|Y can be well-approximated by Gaussians. [sent-99, score-0.228]

48 For a multivariate Gaussian source N (µ, Σ), the average number of bits needed to code a vector . [sent-101, score-0.15]

49 n subject to a distortion ε2 is approximately Rε (Σ) = 1 log2 det I+ ε2 Σ (bits/vector). [sent-102, score-0.124]

50 , xm ) with sample mean µ = m i xi and covariance Σ(X ) = m−1 i (xi − ˆ ˆ ˆ µ)(xi − µ)T can be represented upto expected distortion ε2 using ≈ mRε (Σ) bits. [sent-106, score-0.132]

51 The total number of bits required to code X is therefore 2 log2 1 + ε2 ˆ ˆ nˆ n µT µ . [sent-109, score-0.124]

52 The ﬁrst term gives the number of bits needed to represent the distribution of the xi about their mean, and the second gives the cost of representing the mean. [sent-113, score-0.117]

53 The above function well-approximates the optimal coding length for Gaussian data, and has also been shown to give a good upper bound on the number of bits needed to code ﬁnitely many samples lying on a linear subspace (e. [sent-114, score-0.447]

54 If the test class labels Y are known to have the marginal distribution P [Y = j] = πj , then the optimal coding lengths are (within one bit): L(j) = − log2 πj . [sent-119, score-0.203]

55 ˆ Combining this coding length the class label with the coding length function (3) for the observations, we summarize the MICL criterion (2) as Algorithm 1 below: Algorithm 1 (MICL Classiﬁer). [sent-122, score-0.611]

56 1: Input: m training samples partitioned into K classes X1 , X2 , . [sent-123, score-0.084]

57 ˆ 3: Compute incremental coding length of x for each class: δLε (x, j) = Lε (Xj ∪ {x}) − Lε (Xj ) − log2 πj , ˆ . [sent-128, score-0.281]

58 In both cases, the MICL criterion harnesses the covariance structure of the data to achieve good classiﬁcation in sparsely sampled regions. [sent-136, score-0.162]

59 In the left example, the criterion interpolates the data structure to achieve correct classiﬁcation, even near the origin where the samples are sparse. [sent-137, score-0.129]

60 In the right example, the criterion extrapolates the horizontal line to the other side of the plane. [sent-138, score-0.091]

61 2 Asymptotic Behavior and Relationship to MAP In this section, we analyze the asymptotic behavior of Algorithm 1 as the number of training samples goes to inﬁnity. [sent-143, score-0.101]

62 The following result, whose proof is given in [21], indicates that MICL converges to a regularized version of ML/MAP, subject to a reward on the dimension of the classes: Theorem 1 (Asymptotic MICL [21]). [sent-144, score-0.085]

63 Let the training samples {(xi , yi )}m ∼iid pX,Y (x, y), with i=1 . [sent-145, score-0.084]

64 Then as m → ∞, the MICL criterion coincides (asymptotically, with probability one) with the decision rule y (x) = argmax LG x µj , Σj + ˆ j=1,. [sent-148, score-0.116]

65 where LG (·| µ, Σ) is the log-likelihood function for a N (µ, Σ) distribution , and Dε (Σj ) = 2 tr(Σj (Σj + ε I)−1 ) is the effective dimension of the j-th model, relative to the distortion ε2 . [sent-152, score-0.101]

66 2 10 10 22 34 Ambient dimension 0 Number of training samples −R Number of training samples Number of training samples MAP Number of training samples R 75 −R MICL 50 0. [sent-165, score-0.306]

67 This result shows that asymptotically, MICL generates a family of MAP-like classiﬁers parametrized by the distortion ε2 . [sent-179, score-0.071]

68 Given a ﬁnite number, m, of samples, any reasonable rule for choosing the distortion ε2 should therefore ensure that ε → 0 as m → ∞. [sent-184, score-0.071]

69 As the number of samples, m → ∞, the MICL criterion 1 (2) converges to its asymptotic form, (4) at a rate of m− 2 . [sent-190, score-0.137]

70 3 Improvements over MAP In the above, we have established the fact that asymptotically, the MICL criterion (4) is just as good as the MAP criterion. [sent-193, score-0.091]

71 Nevertheless, the MICL criterion makes several important modiﬁcations to MAP, which signiﬁcantly improve its performance on sparsely sampled or degenerate data. [sent-194, score-0.173]

72 Notice that the ﬁrst two terms of the asymptotic 2 MICL criterion (4) have the form of a MAP criterion, based on an N (µ, Σ + ε I) distribution. [sent-196, score-0.123]

73 In each example, we vary the number of training samples, m, and the distortion ε. [sent-202, score-0.102]

74 For each (m, ε) combination, we draw m training samples from two Gaussian distributions N (µi , Σi ), i = 1, 2, and estimate the Bayes risk of the resulting MICL and MAP classiﬁers. [sent-203, score-0.119]

75 The effective dimension term Dε (Σj ) in the large-n MICL criterion (4) can 2 n be rewritten as Dε (Σj ) = i=1 λi /( ε + λi ), where λi is the ith eigenvalue of Σj . [sent-213, score-0.121]

76 In general, 5 Dε can be viewed as “softened” estimate of the dimension3 , relative to the distortion ε2 . [sent-221, score-0.071]

77 The regularization parameter in RDA and the distortion ε for MICL are chosen independently for each trial by cross validation. [sent-228, score-0.098]

78 Nevertheless, in this subsection, we discuss two practical modiﬁcations to the MICL criterion that are applicable to arbitrary distributions and preserve the desirable properties discussed in the previous subsections. [sent-242, score-0.116]

79 Since X X T and X T X have the same non-zero eigenvalues, log2 det I +α X X T = log2 det I +α X T X . [sent-244, score-0.076]

80 In practice, popular choices include the polynomial kernel k(x1 , x2 ) = (xT x2 + 1)d , the radial basis function (RBF) 1 kernel k(x1 , x2 ) = exp(−γ x1 − x2 2 ) and their variants. [sent-249, score-0.076]

81 Notice, however, that whereas SVM constructs a linear decision boundary in the lifted space H, kernel MICL exploits the covariance structure of the lifted data, generating decision boundaries that are (asymptotically) quadratic. [sent-252, score-0.18]

82 In Section 3 we will see that even for real data whose statistical nature is unclear, kernel MICL outperforms SVM when applied with the same kernel function. [sent-253, score-0.097]

83 In the MICL classiﬁer (Algorithm 1), we replace the incremental coding length δLε (x, j) by its local version: k k δLε (x, j) = Lε (Nj (x) ∪ {x}) − Lε (Nj (x)) + L(j), (6) k with L(j) = − log2 (|Nj (x)|/|N k (x)|). [sent-263, score-0.306]

84 Then if k = o(m) and k, m → ∞, the local MICL criterion converges to the MAP criterion (1). [sent-266, score-0.221]

85 This follows, since as the radius of the neighborhood shrinks, the cost of coding the class label, k − log2 (|Nj (x)|/|N k (x)|) → − log2 pj (x), dominates the coding length, (6). [sent-267, score-0.398]

86 In this asymptotic setting the local MICL criterion behaves like k-Nearest Neighbor (k-NN). [sent-268, score-0.148]

87 However, the ﬁnitesample behavior of the local MICL criterion can differ drastically from that of k-NN, especially 3 4 This quantity has been dubbed the effective number of parameters in the context of ridge regression [9]. [sent-269, score-0.116]

88 In this case, from (4), local MICL effectively approximates the local shape of the distribution pj (x) by a (regularized) Gaussian, exploiting structure in the distribution of the nearest neighbors (see ﬁgure 3). [sent-287, score-0.071]

89 We ﬁrst test the MICL classiﬁer on two standard datasets for handwritten digit recognition (Table 1 top). [sent-290, score-0.07]

90 The MNIST handwritten digit dataset [10] consists of 60,000 training images and 10,000 test images. [sent-291, score-0.08]

91 Other preprocessing steps, synthetic training images, or more advanced skew-correction and normalization techniques have been applied to lower the error rate for SVM (e. [sent-314, score-0.076]

92 We further verify MICL’s effectiveness on sparsely sampled high-dimensional data using the Yale Face Database B [7], which tests illumination sensitivity of face recognition algorithms. [sent-320, score-0.126]

93 MICL outperforms classical face recognition methods such as Eigenfaces on Yale Face Database B [7]. [sent-328, score-0.089]

94 We suggest that the source of this improved performance is precisely the regularization induced by lossy coding. [sent-330, score-0.171]

95 MICL generates a family of classiﬁers that inherit many of the good properties of MAP, RDA, and k-NN, while extending their working conditions to sparsely sampled or degenerate high-dimensional observations. [sent-337, score-0.082]

96 MICL and its kernel and local versions approach best reported performance on high-dimensional visual recognition problems without domain-speciﬁc engineering. [sent-338, score-0.084]

97 The minimum description length principle in coding and modeling. [sent-344, score-0.315]

98 From few to many: Illumination cone models for face recognition under variable lighting and pose. [sent-373, score-0.082]

99 A source coding approach to classiﬁcation by vector quantization and the principle of minimum description length. [sent-401, score-0.252]

100 Segmentation of multivariate mixed data via lossy data coding and compression. [sent-408, score-0.318]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('micl', 0.908), ('coding', 0.174), ('lossy', 0.144), ('criterion', 0.091), ('classi', 0.08), ('bits', 0.077), ('rda', 0.075), ('distortion', 0.071), ('length', 0.063), ('mdl', 0.06), ('map', 0.06), ('er', 0.051), ('degenerate', 0.048), ('code', 0.047), ('face', 0.047), ('incremental', 0.044), ('svm', 0.04), ('xj', 0.038), ('det', 0.038), ('kernel', 0.038), ('samples', 0.038), ('minimum', 0.037), ('sparsely', 0.034), ('allowable', 0.034), ('icl', 0.032), ('lmicl', 0.032), ('nondegenerate', 0.032), ('asymptotic', 0.032), ('upto', 0.032), ('asymptotically', 0.031), ('training', 0.031), ('px', 0.03), ('dimension', 0.03), ('class', 0.029), ('py', 0.028), ('handwritten', 0.027), ('regularization', 0.027), ('pami', 0.026), ('imagery', 0.026), ('needed', 0.026), ('preprocessing', 0.026), ('regularized', 0.026), ('principle', 0.025), ('risk', 0.025), ('local', 0.025), ('distributions', 0.025), ('decision', 0.025), ('illumination', 0.024), ('nj', 0.023), ('ers', 0.023), ('discriminant', 0.023), ('ambient', 0.023), ('outperforming', 0.023), ('shannon', 0.023), ('subspace', 0.022), ('digit', 0.022), ('rn', 0.022), ('harnesses', 0.022), ('lossless', 0.022), ('losslessly', 0.022), ('submanifold', 0.022), ('surrogates', 0.022), ('outperforms', 0.021), ('rm', 0.021), ('recognition', 0.021), ('pj', 0.021), ('gaussian', 0.02), ('boundaries', 0.02), ('boundary', 0.019), ('excess', 0.019), ('embedded', 0.019), ('error', 0.019), ('lg', 0.019), ('lifted', 0.019), ('yale', 0.019), ('conventional', 0.018), ('cation', 0.018), ('minj', 0.017), ('simard', 0.017), ('tao', 0.017), ('properly', 0.017), ('label', 0.017), ('notice', 0.016), ('anisotropic', 0.016), ('description', 0.016), ('subject', 0.015), ('classes', 0.015), ('ap', 0.015), ('renders', 0.015), ('wright', 0.015), ('kernelized', 0.015), ('achieves', 0.015), ('minimizes', 0.015), ('yi', 0.015), ('covariance', 0.015), ('converges', 0.014), ('pronounced', 0.014), ('lighting', 0.014), ('lecun', 0.014), ('xi', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum

2 0.056562234 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images

Author: Pierre Garrigues, Bruno A. Olshausen

Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefﬁcients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive ﬁelds. However, the resulting sparse coefﬁcients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefﬁcients by including a pairwise coupling term in the prior over the coefﬁcient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.

3 0.053277716 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun

Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efﬁcient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1

4 0.046793655 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor

Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classiﬁer. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1

5 0.042453784 145 nips-2007-On Sparsity and Overcompleteness in Image Models

Author: Pietro Berkes, Richard Turner, Maneesh Sahani

Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we ﬁnd that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1

6 0.04160732 118 nips-2007-Learning with Transformation Invariant Kernels

7 0.039982349 160 nips-2007-Random Features for Large-Scale Kernel Machines

8 0.039915983 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

9 0.038389802 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

10 0.03744195 8 nips-2007-A New View of Automatic Relevance Determination

11 0.036888935 109 nips-2007-Kernels on Attributed Pointsets with Applications

12 0.036631279 147 nips-2007-One-Pass Boosting

13 0.036293443 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

14 0.035674471 32 nips-2007-Bayesian Co-Training

15 0.035445891 161 nips-2007-Random Projections for Manifold Learning

16 0.035338376 112 nips-2007-Learning Monotonic Transformations for Classification

17 0.035185348 137 nips-2007-Multiple-Instance Pruning For Learning Efficient Cascade Detectors

18 0.035111845 166 nips-2007-Regularized Boost for Semi-Supervised Learning

19 0.03491183 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers

20 0.034882221 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.119), (1, 0.031), (2, -0.044), (3, 0.039), (4, -0.004), (5, 0.046), (6, 0.013), (7, 0.031), (8, -0.007), (9, 0.031), (10, 0.035), (11, -0.012), (12, -0.013), (13, -0.013), (14, -0.001), (15, -0.013), (16, 0.045), (17, 0.032), (18, -0.036), (19, -0.005), (20, 0.017), (21, 0.035), (22, -0.013), (23, 0.018), (24, -0.022), (25, -0.036), (26, 0.011), (27, 0.05), (28, -0.064), (29, 0.016), (30, 0.029), (31, -0.055), (32, 0.0), (33, -0.046), (34, -0.009), (35, -0.004), (36, 0.01), (37, 0.001), (38, -0.111), (39, 0.069), (40, 0.005), (41, 0.055), (42, -0.031), (43, -0.026), (44, 0.06), (45, -0.076), (46, -0.1), (47, 0.038), (48, 0.01), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88576621 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum

2 0.57110465 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers

Author: Marco Barreno, Alvaro Cardenas, J. D. Tygar

Abstract: We present a new analysis for the combination of binary classiﬁers. Our analysis makes use of the Neyman-Pearson lemma as a theoretical basis to analyze combinations of classiﬁers. We give a method for ﬁnding the optimal decision rule for a combination of classiﬁers and prove that it has the optimal ROC curve. We show how our method generalizes and improves previous work on combining classiﬁers and generating ROC curves. 1

3 0.54518968 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis

Abstract: An important problem in many ﬁelds is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.

4 0.52777964 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is signiﬁcantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

5 0.51678991 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan

Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classiﬁcation problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classiﬁcation problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1

6 0.5077948 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

7 0.50352544 118 nips-2007-Learning with Transformation Invariant Kernels

8 0.49771449 109 nips-2007-Kernels on Attributed Pointsets with Applications

9 0.48099008 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

10 0.47856975 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images

11 0.46573392 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

12 0.45753688 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

13 0.45304275 24 nips-2007-An Analysis of Inference with the Universum

14 0.44208908 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

15 0.44023404 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning

16 0.43816483 32 nips-2007-Bayesian Co-Training

17 0.43022868 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

18 0.42479849 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

19 0.42190275 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

20 0.41898066 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.056), (13, 0.046), (16, 0.018), (18, 0.013), (19, 0.023), (21, 0.065), (29, 0.202), (31, 0.02), (34, 0.027), (35, 0.022), (47, 0.097), (49, 0.027), (83, 0.143), (85, 0.027), (87, 0.024), (90, 0.07)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95025724 55 nips-2007-Computing Robust Counter-Strategies

Author: Michael Johanson, Martin Zinkevich, Michael Bowling

Abstract: Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must ﬁnd a good counterstrategy to the inferred posterior of the other agents’ behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents’ expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modiﬁed game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold’em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses. 1

2 0.93261117 169 nips-2007-Retrieved context and the discovery of semantic structure

Author: Vinayak Rao, Marc Howard

Abstract: Semantic memory refers to our knowledge of facts and relationships between concepts. A successful semantic memory depends on inferring relationships between items that are not explicitly taught. Recent mathematical modeling of episodic memory argues that episodic recall relies on retrieval of a gradually-changing representation of temporal context. We show that retrieved context enables the development of a global memory space that reﬂects relationships between all items that have been previously learned. When newly-learned information is integrated into this structure, it is placed in some relationship to all other items, even if that relationship has not been explicitly learned. We demonstrate this effect for global semantic structures shaped topologically as a ring, and as a two-dimensional sheet. We also examined the utility of this learning algorithm for learning a more realistic semantic space by training it on a large pool of synonym pairs. Retrieved context enabled the model to “infer” relationships between synonym pairs that had not yet been presented. 1

3 0.86227739 165 nips-2007-Regret Minimization in Games with Incomplete Information

Author: Martin Zinkevich, Michael Johanson, Michael Bowling, Carmelo Piccione

Abstract: Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold’em with as many as 1012 states, two orders of magnitude larger than previous methods. 1

4 0.85433477 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs

Author: Hai L. Chieu, Wee S. Lee, Yee W. Teh

Abstract: We describe a new algorithm, Relaxed Survey Propagation (RSP), for ﬁnding MAP conﬁgurations in Markov random ﬁelds. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. 1

same-paper 5 0.82091796 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum

6 0.71374589 63 nips-2007-Convex Relaxations of Latent Variable Training

7 0.71101093 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

8 0.710132 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

9 0.70603979 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

10 0.70410913 115 nips-2007-Learning the 2-D Topology of Images

11 0.70385796 86 nips-2007-Exponential Family Predictive Representations of State

12 0.70188463 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

13 0.7012037 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

14 0.70029944 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

15 0.70019513 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

16 0.69929349 7 nips-2007-A Kernel Statistical Test of Independence

17 0.69819975 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

18 0.69784629 24 nips-2007-An Analysis of Inference with the Universum

19 0.69780886 187 nips-2007-Structured Learning with Approximate Inference

20 0.69731432 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC