nips nips2005 nips2005-196 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk School of Electronics and Computer Science, University of Southampton, Southampton, England Abstract Kernel methods make it relatively easy to define complex highdimensional feature spaces. [sent-19, score-0.061]
2 This raises the question of how we can identify the relevant subspaces for a particular learning task. [sent-20, score-0.024]
3 When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). [sent-21, score-0.218]
4 This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. [sent-22, score-0.185]
5 We present both experimental and theoretical analysis of the approach showing encouraging results and insights. [sent-23, score-0.062]
6 1 Introduction Kernel methods enable us to work with high dimensional feature spaces by defining weight vectors implicitly as linear combinations of the training examples. [sent-24, score-0.199]
7 This even makes it practical to learn in infinite dimensional spaces as for example when using the Gaussian kernel. [sent-25, score-0.045]
8 The Gaussian kernel is an extreme example, but techniques have been developed to define kernels for a range of different datatypes, in many cases characterised by very high dimensionality. [sent-26, score-0.139]
9 Examples are the string kernels for text, graph kernels for graphs, marginal kernels, kernels for image data, etc. [sent-27, score-0.163]
10 With this plethora of high dimensional representations it is frequently helpful to assist learning algorithms by preprocessing the feature space in projecting the data into a low dimensional subspace that contains the relevant information for the learning task. [sent-28, score-0.273]
11 Methods of performing this include principle components analysis (PCA) [7], partial least squares [8], kernel independent component analysis (KICA) [1] and kernel canonical correlation analysis (KCCA) [5]. [sent-29, score-0.25]
12 The last method requires two views of the data both of which contain all of the relevant information for the learning task, but which individually contain representation specific details that are different and irrelevant. [sent-30, score-0.177]
13 Perhaps the simplest example of this situation is a paired document corpus in which we have the same information in two languages. [sent-31, score-0.043]
14 KCCA attempts to isolate feature space directions that correlate between the two views and hence might be expected to represent the common relevant information. [sent-32, score-0.231]
15 Hence, one can view this preprocessing as a denoising of the individual representations through cross-correlating them. [sent-33, score-0.107]
16 Experiments have shown how using this as a preprocessing step can improve subsequent analysis in for example classification experiments using a support vector machine (SVM) [6]. [sent-34, score-0.043]
17 Though the combination of KCCA and SVM seems effective, there appears no guarantee that the directions identified by KCCA will be best suited to the classification task. [sent-36, score-0.049]
18 This paper therefore looks at the possibility of combining the two distinct stages of KCCA and SVM into a single optimisation that will be termed SVM-2K. [sent-37, score-0.19]
19 Experiments are then given showing the performance of the algorithm on an image classification task. [sent-39, score-0.081]
20 Though the performance is encouraging it is in many ways counter-intuitive, leading to speculation about why an improvement is seen. [sent-40, score-0.054]
21 To investigate this question an analysis of its generalisation properties is given in the following two sections, before drawing conclusions. [sent-41, score-0.088]
22 2 SVM-2K Algorithm We assume that we are given two views of the same data, one expressed through a feature projection φA with corresponding kernel κA and the other through a feature projection φB with kernel κB . [sent-42, score-0.441]
23 A paired data set is then given by a set S = {(φA (x1 ), φB (x1 )), . [sent-43, score-0.043]
24 , (φA (xℓ ), φB (xℓ ))}, where for example φA could be the feature vector associated with one language and φB that associated with a second language. [sent-46, score-0.061]
25 The KCCA algorithm looks for directions in the two feature spaces such that when the training data is projected onto those directions the two vectors (one for each view) of values obtained are maximally correlated. [sent-48, score-0.273]
26 One can also characterise these directions as those that minimise the two norm between the two vectors under the constraint that they both have norm 1 [5]. [sent-49, score-0.154]
27 We can think of this as constraining the choice of weight vectors in the two spaces. [sent-50, score-0.093]
28 KCCA would typically find a sequence of projection directions of dimension anywhere between 50 and 500 that can then be used as the feature space for training an SVM [6]. [sent-51, score-0.166]
29 An SVM can be thought of as a 1-dimensional projection followed by thresholding, so SVM-2K combines the two steps by introducing the constraint of similarity between two 1-dimensional projections identifying two distinct SVMs one in each of the two feature spaces. [sent-52, score-0.14]
30 The extra constraint is chosen slightly differently from the 2-norm that characterises KCCA. [sent-53, score-0.044]
31 We rather take an ǫ-insensitive 1-norm using slack variables to measure the amount by which points fail to meet ǫ similarity: | wA , φA (xi ) + bA − wB , φB (xi ) − bB | ≤ ηi + ǫ, where wA , bA (wB , bB ) are the weight and threshold of the first (second) SVM. [sent-54, score-0.031]
32 ˆ ˆ b b Let wA , wB , ˆA , ˆB be the solution to this optimisation problem. [sent-56, score-0.131]
33 i=1 3 Experimental results Figure 1: Typical example images from the PASCAL VOC challenge database. [sent-61, score-0.046]
34 The performance of the algorithms developed in this paper we evaluated on PASCAL Visual Object Classes (VOC) challenge dataset test11 . [sent-63, score-0.025]
35 This is a new dataset consisting of four object classes in realistic scenes. [sent-64, score-0.094]
36 The object classes are, motorbikes (M), bicycles (B), people (P) and cars (C) with the dataset containing 684 training set images consisting of (214, 114, 84, 272) images in each class and 689 test set images with (216, 114, 84, 275) for each class. [sent-65, score-0.425]
37 As can be seen in Figure 1 this is a very challenging dataset with objects of widely varying type, pose, illumination, occlusion, background, etc. [sent-66, score-0.025]
38 The task is to classify the image according to whether it contains a given object type. [sent-67, score-0.092]
39 categories M, B, C and P) against non-object images from the database (i. [sent-70, score-0.07]
40 The training set contained 100 positive and 100 negative images. [sent-73, score-0.023]
41 The tests are carried out on 100 new images, half belonging to the learned class and half not. [sent-74, score-0.044]
42 These methods represent an image in terms of the features of a set of small image patches. [sent-76, score-0.092]
43 By carefully choosing the patches and their features this representation can be made largely robust to the common types of image transformation, e. [sent-77, score-0.072]
44 Two views were provided of each image through the use of different patch types. [sent-80, score-0.143]
45 For one image, several hundred characteristic patches were detected according to the complexity of the images. [sent-83, score-0.127]
46 These were then clustered around K = 400 centres for each feature space. [sent-84, score-0.061]
47 Each image is then represented as a histogram over these centres. [sent-85, score-0.046]
48 So finally, for one image there are two feature vectors of length 400 that provide the two views. [sent-86, score-0.146]
49 13 Table 1: Results for 4 datasets showing test accuracy of the individual SVMs and SVM-2K. [sent-103, score-0.069]
50 Figure 1 show the results of the test errors obtained for the different categories for the individual SVMs and the SVM-2K. [sent-104, score-0.058]
51 There is a clear improvement in performance of the SVM-2K over the two individual SVMs in all four categories. [sent-105, score-0.061]
52 Intuitively it would appear better to take advantage of the abilities of the different representations to better fit the data. [sent-107, score-0.03]
53 In order to understand this apparent contradiction we now consider a theoretical analysis of the generalisation of the SVM-2K using the framework provided by Rademacher complexity bounds. [sent-108, score-0.189]
54 gz X and a real-valued function class F with a domain X, the empirical Rademacher complexity of F is the random variable 2 ˆ Rℓ (F) = Eσ sup f ∈F ℓ ℓ σi f (xi ) x1 , · · · , xℓ i=1 where σ = {σ1 , · · · , σℓ } are independent uniform {±1}-valued Rademacher random variables. [sent-115, score-0.23]
55 The Rademacher complexity of F is Rℓ (F) = ES 2 ˆ Rℓ (F) = ESσ sup f ∈F ℓ ℓ σi f (xi ) i=1 We use ED to denote expectation with respect to a distribution D and ES when the distribution is the uniform (empirical) distribution on a sample S. [sent-116, score-0.158]
56 Fix δ ∈ (0, 1) and let F be a class of functions mapping from S to [0, 1]. [sent-118, score-0.044]
57 The following result bounds the Rademacher complexity of linear function classes. [sent-121, score-0.127]
58 [2] If κ : X × X → R is a kernel, and S = {x1 , · · · , xℓ } is a sample of point from X, then the empirical Rademacher complexity of the class FB satisfies ℓ 2B ˆ Rℓ (F) ≤ ℓ 4. [sent-123, score-0.173]
59 1 κ (xi , xi ) = i=1 2B ℓ tr (K) Analysing SVM-2K ℓ ℓ For SVM-2K, the two feature sets from the same objects are (φA (xi ))i=1 and (φB (xi ))i=1 respectively. [sent-124, score-0.236]
60 We assume the notation and optimisation of SVM-2K given in section 2, equation (1). [sent-125, score-0.131]
61 First observe that an application of Theorem 1 shows that ES [|fA (x) − fB (x)|] ˆ ˆ ≤ ES [| wA , φA (x) + ˆA − wB , φB (x) − ˆB |] b b ≤ ǫ+ 1 ℓ ℓ ηi + i=1 2C ℓ tr(KA ) + tr(KB ) + 3 ln(2/δ) =: D 2ℓ with probability at least 1−δ. [sent-126, score-0.023]
62 Hence, the class of functions we are considering when applying SVM-2K to this problem can be restricted to ℓ FC,D = A B gi κA (xi , x) + gi κB (xi , x) + bA + bB f f : x → 0. [sent-128, score-0.264]
63 5 , i=1 g A′ KA g A + b2 ≤ C 2 , g B ′ KB g B + b2 ≤ C 2 , ES [|fA (x) − fB (x)|] ≤ D A B The class FC,D is clearly closed under negation. [sent-129, score-0.044]
64 Applying the usual Rademacher techniques for margin bounds on generalisation we obtain the following result. [sent-130, score-0.143]
65 Fix δ ∈ (0, 1) and let FC,D be the class of functions described above. [sent-132, score-0.044]
66 Then with probability at least 1 − δ over random draws of samples of size ℓ, every f ∈ FC,D satisfies P(x,y)∼D (sign(f (x)) = y) ≤ 0. [sent-134, score-0.059]
67 2ℓ It therefore remains to compute the empirical Rademacher complexity of FC,D , which is the critical discriminator between the bounds for the individual SVMs and that of the SVM-2K. [sent-136, score-0.189]
68 2 Empirical Rademacher complexity of FC,D We now define an auxiliary function of two weight vectors wA and wB , D(wA , wB ) := ED [| wA , φA (x) + bA − wB , φB (x) − bB |] With this notation we can consider computing the Rademacher complexity of the class FC,D . [sent-138, score-0.316]
69 Fix δ ∈ (0, 1) and let F be a class of functions mapping from S to [0, 1]. [sent-140, score-0.044]
70 Then with probability at least 1 − δ over random draws of samples of size ℓ, every f ∈ F satisfies ES [f (x)] ≤ ED [f (x)] + Rℓ (F) + 3 ˆ ≤ ED [f (x)] + Rℓ (F) + 3 ln(2/δ) 2ℓ ln(2/δ) 2ℓ The proof tracks that of Theorem 1 but is omitted through lack of space. [sent-142, score-0.059]
71 The above result shows that with probability greater than 1 − δ ˆ Rℓ FC,D ≤ Eσ sup wA ≤C wB ≤C ˆ ˆ D(wA ,wB )≤D ℓ i=1 1 ℓ σi [ wA , φA (xi ) + bA + wB , φB (xi ) + bB ] First note that the expression in square brackets is concentrated under the uniform distribution of Rademacher variables. [sent-144, score-0.057]
72 Hence, we can estimate the complexity for a fixed instantiation σ of the the Rademacher variables σ. [sent-145, score-0.101]
73 This is the optimisation solved in the brief experiments described below. [sent-148, score-0.131]
74 3 Experiments with Rademacher complexity We computed the Rademacher complexity for the problems considered in the experimental section above. [sent-150, score-0.202]
75 We wished to verify that the Rademacher complexity of the space FC,D , where C and D are determined by applying the SVM-2K, are indeed significantly lower than that obtained for the SVMs in each space individually. [sent-151, score-0.149]
76 26 Table 2: Results for 4 datasets showing test accuracy and Rademacher complexity (Rad) of the individual SVMs and SVM-2K. [sent-176, score-0.17]
77 Table 2 shows the results for the motorbike, bicycle, people and car datasets. [sent-177, score-0.1]
78 We show the Rademacher complexities for the individual SVMs and for the SVM-2K along with the generalisation results already given in Table 1. [sent-178, score-0.154]
79 As predicted the Rademacher complexity is significantly smaller for SVM-2K, hence confirming the intuition that led to the introduction of the approach, namely that the complexity of the class is reduced by restricting the weight vectors to align on the training data. [sent-180, score-0.339]
80 Provided both representations contain the necessary data we can therefore expect an improvement in generalisation as observed in the reported experiments. [sent-181, score-0.173]
81 5 Conclusions With the plethora of data now being collected in a wide range of fields there is frequently the luxury of having two views of the same phenomenon. [sent-182, score-0.168]
82 The simplest example is paired corpora of documents in different languages, but equally we can think of examples from bioinformatics, machine vision, etc. [sent-183, score-0.066]
83 Frequently it is also reasonable to assume that both views contain all of the relevant information required for a classification task. [sent-184, score-0.149]
84 We have demonstrated that in such cases it can be possible to leaver the correlation between the two views to improve classification accuracy. [sent-185, score-0.125]
85 This has been demonstrated in experiments with a machine vision task. [sent-186, score-0.027]
86 Furthermore, we have undertaken a theoretical analysis to illuminate the source and extent of the advantage that can be obtained, showing in the cases considered a significant reduction in the Rademacher complexity of the corresponding function classes. [sent-187, score-0.136]
87 Rademacher and Gaussian complexities: risk bounds and structural results. [sent-197, score-0.026]
88 Canonical correlation analysis: An overview with application to learning methods. [sent-214, score-0.028]
89 Using kcca for japanese-english cross-language information retrieval and classification. [sent-217, score-0.279]
90 Kernel o u a PCA and de-noising in feature spaces. [sent-227, score-0.061]
91 Kernel partial least squares regression in reproducing kernel hilbert space. [sent-233, score-0.101]
wordName wordTfidf (topN-words)
[('wb', 0.442), ('rademacher', 0.442), ('wa', 0.334), ('kcca', 0.279), ('bb', 0.241), ('ba', 0.193), ('svm', 0.147), ('ka', 0.141), ('optimisation', 0.131), ('kb', 0.131), ('xi', 0.101), ('complexity', 0.101), ('gi', 0.097), ('views', 0.097), ('generalisation', 0.088), ('fb', 0.088), ('svms', 0.087), ('rad', 0.08), ('kernel', 0.078), ('abs', 0.076), ('bicycle', 0.076), ('ln', 0.076), ('tr', 0.074), ('motorbike', 0.066), ('feature', 0.061), ('sup', 0.057), ('hardoon', 0.051), ('hongying', 0.051), ('sandor', 0.051), ('szedmak', 0.051), ('voc', 0.051), ('fa', 0.05), ('car', 0.05), ('people', 0.05), ('es', 0.049), ('directions', 0.049), ('image', 0.046), ('object', 0.046), ('images', 0.046), ('southampton', 0.044), ('motorbikes', 0.044), ('plethora', 0.044), ('class', 0.044), ('preprocessing', 0.043), ('paired', 0.043), ('canonical', 0.043), ('fix', 0.042), ('theorem', 0.041), ('kernels', 0.039), ('vectors', 0.039), ('draws', 0.036), ('showing', 0.035), ('individual', 0.034), ('projection', 0.033), ('cars', 0.032), ('complexities', 0.032), ('weight', 0.031), ('termed', 0.03), ('representations', 0.03), ('usual', 0.029), ('classi', 0.029), ('looks', 0.029), ('correlation', 0.028), ('empirical', 0.028), ('contain', 0.028), ('encouraging', 0.027), ('vision', 0.027), ('frequently', 0.027), ('improvement', 0.027), ('applying', 0.026), ('bounds', 0.026), ('bartlett', 0.026), ('patches', 0.026), ('dataset', 0.025), ('categories', 0.024), ('relevant', 0.024), ('combines', 0.024), ('classes', 0.023), ('think', 0.023), ('satis', 0.023), ('training', 0.023), ('spaces', 0.023), ('john', 0.023), ('least', 0.023), ('dimensional', 0.022), ('constraint', 0.022), ('characterises', 0.022), ('characterised', 0.022), ('csurka', 0.022), ('dance', 0.022), ('analysing', 0.022), ('languages', 0.022), ('meng', 0.022), ('rosipal', 0.022), ('characterise', 0.022), ('minimise', 0.022), ('rming', 0.022), ('wished', 0.022), ('jst', 0.022), ('pascal', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
2 0.23429158 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
3 0.12902531 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
4 0.10643244 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
5 0.10610275 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks
Author: Viren Jain, Valentin Zhigulin, H. S. Seung
Abstract: There is little consensus about the computational function of top-down synaptic connections in the visual system. Here we explore the hypothesis that top-down connections, like bottom-up connections, reflect partwhole relationships. We analyze a recurrent network with bidirectional synaptic interactions between a layer of neurons representing parts and a layer of neurons representing wholes. Within each layer, there is lateral inhibition. When the network detects a whole, it can rigorously enforce part-whole relationships by ignoring parts that do not belong. The network can complete the whole by filling in missing parts. The network can refuse to recognize a whole, if the activated parts do not conform to a stored part-whole relationship. Parameter regimes in which these behaviors happen are identified using the theory of permitted and forbidden sets [3, 4]. The network behaviors are illustrated by recreating Rumelhart and McClelland’s “interactive activation” model [7]. In neural network models of visual object recognition [2, 6, 8], patterns of synaptic connectivity often reflect part-whole relationships between the features that are represented by neurons. For example, the connections of Figure 1 reflect the fact that feature B both contains simpler features A1, A2, and A3, and is contained in more complex features C1, C2, and C3. Such connectivity allows neurons to follow the rule that existence of the part is evidence for existence of the whole. By combining synaptic input from multiple sources of evidence for a feature, a neuron can “decide” whether that feature is present. 1 The synapses shown in Figure 1 are purely bottom-up, directed from simple to complex features. However, there are also top-down connections in the visual system, and there is little consensus about their function. One possibility is that top-down connections also reflect part-whole relationships. They allow feature detectors to make decisions using the rule that existence of the whole is evidence for existence of its parts. In this paper, we analyze the dynamics of a recurrent network in which part-whole relationships are stored as bidirectional synaptic interactions, rather than the unidirectional interactions of Figure 1. The network has a number of interesting computational capabilities. When the network detects a whole, it can rigorously enforce part-whole relationships 1 Synaptic connectivity may reflect other relationships besides part-whole. For example, invariances can be implemented by connecting detectors of several instances of the same feature to the same target, which is consequently an invariant detector of the feature. C1 C2 C3 B A1 A2 A3 Figure 1: The synaptic connections (arrows) of neuron B represent part-whole relationships. Feature B both contains simpler features and is contained in more complex features. The synaptic interactions are drawn one-way, as in most models of visual object recognition. Existence of the part is regarded as evidence for existence of the whole. This paper makes the interactions bidirectional, allowing the existence of the whole to be evidence for the existence of its parts. by ignoring parts that do not belong. The network can complete the whole by filling in missing parts. The network can refuse to recognize a whole, if the activated parts do not conform to a stored part-whole relationship. Parameter regimes in which these behaviors happen are identified using the recently developed theory of permitted and forbidden sets [3, 4]. Our model is closely related to the interactive activation model of word recognition, which was proposed by McClelland and Rumelhart to explain the word superiority effect studied by visual psychologists [7]. Here our concern is not to model a psychological effect, but to characterize mathematically how computations involving part-whole relationships can be carried out by a recurrent network. 1 Network model Suppose that we are given a set of part-whole relationships specified by a ξi = 1, if part i is contained in whole a 0, otherwise We assume that every whole contains at least one part, and every part is contained in at least one whole. The stimulus drives a layer of neurons that detect parts. These neurons also interact with a layer of neurons that detect wholes. We will refer to part-detectors as “P-neurons” and whole-detectors as “W-neurons.” The part-whole relationships are directly stored in the synaptic connections between P and a W neurons. If ξi = 1, the ith neuron in the P layer and the ath neuron in the W layer have a an excitatory interaction of strength γ. If ξi = 0, the neurons have an inhibitory interaction of strength σ. Furthermore, the P-neurons inhibit each other with strength β, and the Wneurons inhibit each other with strength α. All of these interactions are symmetric, and all activation functions are the rectification nonlinearity [z]+ = max{z, 0}. Then the dynamics of the network takes the form ˙ Wa + Wa a Pi ξ i − σ = γ i + a (1 − ξi )Pi − α i Wb , + ˙ Pi + Pi (1) b=a a Wa ξ i − σ = γ a a (1 − ξi )Wa − β a Pj + B i . j=i (2) where Bi is the input to the P layer from the stimulus. Figure 2 shows an example of a network with two wholes. Each whole contains two parts. One of the parts is contained in both wholes. -α Wa excitation γ -σ inhibition P1 B1 -β } W layer Wb -σ P2 -β B2 P3 } P layer B3 Figure 2: Model in example configuration: ξ = {(1, 1, 0), (0, 1, 1)}. When a stimulus is presented, it activates some of the P-neurons, which activate some of the W-neurons. The network eventually converges to a stable steady state. We will assume that α > 1. In the Appendix, we prove that this leads to unconditional winner-take-all behavior in the W layer. In other words, no more than one W-neuron can be active at a stable steady state. If a single W-neuron is active, then a whole has been detected. Potentially there are also many P-neurons active, indicating detection of parts. This representation may have different properties, depending on the choice of parameters β, γ, and σ. As discussed below, these include rigorous enforcement of part-whole relationships, completion of wholes by “filling in” missing parts, and non-recognition of parts that do not conform to a whole. 2 Enforcement of part-whole relationships Suppose that a single W-neuron is active at a stable steady state, so that a whole has been detected. Part-whole relationships are said to be enforced if the network always ignores parts that are not contained in the detected whole, despite potentially strong bottom-up evidence for them. It can be shown that enforcement follows from the inequality σ 2 + β 2 + γ 2 + 2σβγ > 1. (3) which guarantees that neuron i in the P layer is inactive, if neuron a in the W layer is a active and ξi = 0. When part-whole relations are enforced, prior knowledge about legal combinations of parts strictly constrains what may be perceived. This result is proven in the Appendix, and only an intuitive explanation is given here. Enforcement is easiest to understand when there is interlayer inhibition (σ > 0). In this case, the active W-neuron directly inhibits the forbidden P-neurons. The case of σ = 0 is more subtle. Then enforcement is mediated by lateral inhibition in the P layer. Excitatory feedback from the W-neuron has the effect of counteracting the lateral inhibition between the P-neurons that belong to the whole. As a result, these P-neurons become strongly activated enough to inhibit the rest of the P layer. 3 Completion of wholes by filling in missing parts If a W-neuron is active, it excites the P-neurons that belong to the whole. As a result, even if one of these P-neurons receives no bottom-up input (Bi = 0), it is still active. We call this phenomenon “completion,” and it is guaranteed to happen when (4) γ> β The network may thus “imagine” parts that are consistent with the recognized whole, but are not actually present in the stimulus. As with enforcement, this condition depends on top-down connections. √ In the special case γ = β, the interlayer excitation between a W-neuron and its P-neurons exactly cancels out the lateral inhibition between the P-neurons at a steady state. So the recurrent connections effectively vanish, letting the activity of the P-neurons be determined by their feedforward inputs. When the interlayer excitation is stronger than this, the inequality (4) holds, and completion occurs. 4 Non-recognition of a whole If there is no interlayer inhibition (σ = 0), then a single W-neuron is always active, assuming that there is some activity in the P layer. To see this, suppose for the sake of contradiction that all the W-neurons are inactive. Then they receive no inhibition to counteract the excitation from the P layer. This means some of them must be active, which contradicts our assumption. This means that the network always recognizes a whole, even if the stimulus is very different from any part-whole combination that is stored in the network. However, if interlayer inhibition is sufficiently strong (large σ), the network may refuse to recognize a whole. Neurons in the P layer are activated, but there is no activity in the W layer. Formal conditions on σ can be derived, but are not given here because of space limitations. In case of non-recognition, constraints on the P-layer are not enforced. It is possible for the network to detect a configuration of parts that is not consistent with any stored whole. 5 Example: Interactive Activation model To illustrate the computational capabilities of our network, we use it to recreate the interactive activation (IA) model of McClelland and Rumelhart. Figure 3 shows numerical simulations of a network containing three layers of neurons representing strokes, letters, and words, respectively. There are 16 possible strokes in each of four letter positions. For each stroke, there are two neurons, one signaling the presence of the stroke and the other signaling its absence. Letter neurons represent each letter of the alphabet in each of four positions. Word neurons represent each of 1200 common four letter words. The letter and word layers correspond to the P and W layers that were introduced previously. There are bidirectional interactions between the letter and word layers, and lateral inhibition within the layers. The letter neurons also receive input from the stroke neurons, but this interaction is unidirectional. Our network differs in two ways from the original IA model. First, all interactions involving letter and word neurons are symmetric. In the original model, the interactions between the letter and word layers were asymmetric. In particular, inhibitory connections only ran from letter neurons to word neurons, and not vice versa. Second, the only nonlinearity in our model is rectification. These two aspects allow us to apply the full machinery of the theory of permitted and forbidden sets. Figure 3 shows the result of presenting the stimulus “MO M” for four different settings of parameters. In each of the four cases, the word layer of the network converges to the same result, detecting the word “MOON”, which is the closest stored word to the stimulus. However, the activity in the letter layer is different in the four cases. input: P layer reconstruction W layer P layer reconstruction W layer completion noncompletion enforcement non-enforcement Figure 3: Simulation of 4 different parameter regimes in a letter- word recognition network. Within each panel, the middle column presents a feature- layer reconstruction based on the letter activity shown in the left column. W layer activity is shown in the right column. The top row shows the network state after 10 iterations of the dynamics. The bottom row shows the steady state. In the left column, the parameters obey the inequality (3), so that part- whole relationships are enforced. The activity of the letter layer is visualized by activating the strokes corresponding to each active letter neuron. The activated letters are part of the word “MOON”. In the top left, the inequality (4) is satisfied, so that the missing “O” in the stimulus is filled in. In the bottom left, completion does not occur. In the simulations of the right column, parameters are such that part- whole relationships are not enforced. Consequently, the word layer is much more active. Bottom- up input provides evidence for several other letters, which is not suppressed. In the top right, the inequality (4) is satisfied, so that the missing “O” in the stimulus is filled in. In the bottom right, the “O” neuron is not activated in the third position, so there is no completion. However, some letter neurons for the third position are activated, due to the input from neurons that indicate the absence of strokes. input: non-recognition event multi-stability Figure 4: Simulation of a non- recognition event and example of multistability. Figure 4 shows simulations for large σ, deep in the enforcement regime where non- recognition is a possibility. From one initial condition, the network converges to a state in which no W neurons are active, a non- recognition. From another initial condition, the network detects the word “NORM”. Deep in the enforcement regime, the top- down feedback can be so strong that the network has multiple stable states, many of which bear little resemblance to the stimulus at all. This is a problematic aspect of this network. It can be prevented by setting parameters at the edge of the enforcement regime. 6 Discussion We have analyzed a recurrent network that performs computations involving part- whole relationships. The network can fill in missing parts and suppress parts that do not belong. These two computations are distinct and can be dissociated from each other, as shown in Figure 3. While these two computations can also be performed by associative memory models, they are not typically dissociable in these models. For example, in the Hopfield model pattern completion and noise suppression are both the result of recall of one of a finite number of stereotyped activity patterns. We believe that our model is more appropriate for perceptual systems, because its behavior is piecewise linear, due its reliance on rectification nonlinearity. Therefore, analog aspects of computation are able to coexist with the part-whole relationships. Furthermore, in our model the stimulus is encoded in maintained synaptic input to the network, rather than as an initial condition of the dynamics. A Appendix: Permitted and forbidden sets Our mathematical results depend on the theory of permitted and forbidden sets [3, 4], which is summarized briefly here. The theory is applicable to neural networks with rectification nonlinearity, of the form xi + xi = [bi + j Wij xj ]+ . Neuron i is said to be active when ˙ xi > 0. For a network of N neurons, there are 2N possible sets of active neurons. For each active set, consider the submatrix of Wij corresponding to the synapses between active neurons. If all eigenvalues of this submatrix have real parts less than or equal to unity, then the active set is said to be permitted. Otherwise the active set is said to be forbidden. A set is permitted if and only if there exists an input vector b such that those neurons are active at a stable steady state. Permitted sets can be regarded as memories stored in the synaptic connections Wij . If Wij is a symmetric matrix, the nesting property holds: every subset of a permitted set is permitted, and every superset of a forbidden set is forbidden. The present model can be seen as a general method for storing permitted sets in a recurrent network. This method introduces a neuron for each permitted set, relying on a unary or “grandmother cell” representation. In contrast, Xie et al.[9] used lateral inhibition in a single layer of neurons to store permitted sets. By introducing extra neurons, the present model achieves superior storage capacity, much as unary models of associative memory [1] surpass distributed models [5]. A.1 Unconditional winner-take-all in the W layer The synapses between two W-neurons have strengths 0 −α −α 0 The eigenvalues of this matrix are ±α. Therefore two W-neurons constitute a forbidden set if α > 1. By the nesting property, it follows more than two W-neurons is also a forbidden set, and that the W layer has the unconditional winner-take-all property. A.2 Part-whole combinations as permitted sets Theorem 1. Suppose that β < 1. If γ 2 < β + (1 − β)/k then any combination of k ≥ 1 parts consistent with a whole corresponds to a permitted set. Proof. Consider k parts belonging to a whole. They are represented by one W-neuron and k P-neurons, with synaptic connections given by the (k + 1) × (k + 1) matrix M= −β(11T − I) γ1 , γ1T 0 (5) where 1 is the k- dimensional vector whose elements are all equal to one. Two eigenvectors of M are of the form (1T c), and have the same eigenvalues as the 2 × 2 matrix −β(k − 1) γk γ 0 This matrix has eigenvalues less than one when γ 2 < β + (1 − β)/k and β(k − 1) + 2 > 0. The other k − 1 eigenvectors are of the form (dT , 0), where dT 1 = 0. These have eigenvalues β. Therefore all eigenvalues of W are less than one if the condition of the theorem is satisfied. A.3 Constraints on combining parts Here, we derive conditions under which the network can enforce the constraint that steady state activity be confined to parts that constitute a whole. Theorem 2. Suppose that β > 0 and σ 2 +β 2 +γ 2 +2σβγ > 1 If a W- neuron is active, then only P- neurons corresponding to parts contained in the relevant whole can be active at a stable steady state. Proof. Consider P- neurons Pi , Pj , and W- neuron Wa . Supa a pose that ξi = 1 but ξj = 0. As shown in Figure 5, the matrix of connections is given by: W = 0 −β γ −β 0 −σ γ −σ 0 (6) Wa γ Pi -σ -β Pj Figure 5: A set of one W- neuron and two P- neurons is forbidden if one part belongs to the whole and the other does not. This set is permitted if all eigenvalues of W − I have negative real parts. The characteristic equation of I − W is λ3 + b1 λ2 + b2 λ + b3 = 0, where b1 = 3, b2 = 3 − σ 2 − β 2 − γ 2 and b3 = 1−2σβγ−σ 2 −β 2 −γ 2 . According to the Routh- Hurwitz theorem, all the eigenvalues have negative real parts if and only if b1 > 0, b3 > 0 and b1 b2 > b3 . Clearly, the first condition is always satisfied. The second condition is more restrictive than the third. It is satisfied only when σ 2 + β 2 + γ 2 + 2σβγ < 1. Hence, one of the eigenvalues has a positive real part when this condition is broken, i.e., when σ 2 +β 2 +γ 2 +2σβγ > 1. By the nesting property, any larger set of P- neurons inconsistent with the W- neuron is also forbidden. A.4 Completion of wholes √ Theorem 3. If γ > β and a single W- neuron a is active at a steady state, then Pi > 0 a for all i such that ξi = 1. Proof. Suppose that the detected whole has k parts. At the steady state Pi = a ξi Bi − (β − γ 2 )Ptot 1−β + where Ptot = Pi = i 1 1 − β + (β − γ 2 )k k a B i ξi i=1 (7) A.5 Preventing runaway If feedback loops cause the network activity to diverge, then the preceding analyses are not relevant. Here we give a sufficient condition guaranteeing that runaway instability does not happen. It is not a necessary condition. Interestingly, the condition implies the condition of Theorem 1. Theorem 4. Suppose that P and W obey the dynamics of Eqs. (1) and (2), and define the objective function E 1−α 2 = − 2 Wa a α + 2 Wa a 1−β + 2 a Pi Wa ξi + σ Bi Pi − γ i 2 ia Pi2 i 2 β + 2 Pi i a (1 − ξi )Pi Wa . (8) ia Then E is a Lyapunov like function that, given β > γ 2 − dynamics to a stable steady state. 1−γ 2 N −1 , ensures convergence of the Proof. (sketch) Differentiation of E with respect to time shows that that E is nonincreasing in the nonnegative orthant and constant only at steady states of the network dynamics. We must also show that E is radially unbounded, which is true if the quadratic part of E is copositive definite. Note that the last term of E is lower-bounded by zero and the previous term is upper bounded by γ ia Pi Wa . We assume α > 1. Thus, we can use Cauchy’s 2 2 inequality, i Pi2 ≥ ( i Pi ) /N , and the fact that a Wa ≤ ( a Wa )2 for Wa ≥ 0, to derive E≥ 1 2 Wa )2 + ( If β > γ 2 − unbounded. a 1−γ 2 N −1 , 1 − β + βN ( N Pi )2 − 2γ( i Wa a Pi ) i − Bi Pi . (9) i the quadratic form in the inequality is positive definite and E is radially References [1] E. B. Baum, J. Moody, and F. Wilczek. Internal representations for associative memory. Biol. Cybern., 59:217–228, 1988. [2] K. Fukushima. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern, 36(4):193–202, 1980. [3] R.H. Hahnloser, R. Sarpeshkar, M.A. Mahowald, R.J. Douglas, and H.S. Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947– 51, Jun 22 2000. [4] R.H. Hahnloser, H.S. Seung, and J.-J. Slotine. Permitted and forbidden sets in symmetric threshold-linear networks. Neural Computation, 15:621–638, 2003. [5] J.J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci U S A, 79(8):2554–8, Apr 1982. [6] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1:541–551, 1989. [7] J. L. McClelland and D. E. Rumelhart. An interactive activation model of context effects in letter perception: Part i. an account of basic findings. Psychological Review, 88(5):375–407, Sep 1981. [8] M Riesenhuber and T Poggio. Hierarchical models of object recognition in cortex. Nat Neurosci, 2(11):1019–25, Nov 1999. [9] X. Xie, R.H. Hahnloser, and H. S. Seung. Selectively grouping neurons in recurrent networks of lateral inhibition. Neural Computation, 14:2627–2646, 2002.
6 0.08261548 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
7 0.079861775 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
8 0.070923492 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
9 0.066789046 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
10 0.063190706 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
11 0.062959217 62 nips-2005-Efficient Estimation of OOMs
12 0.062214892 77 nips-2005-From Lasso regression to Feature vector machine
13 0.060817 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
14 0.060209349 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
15 0.058672413 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
16 0.058363743 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
17 0.05639324 105 nips-2005-Large-Scale Multiclass Transduction
18 0.055157214 138 nips-2005-Non-Local Manifold Parzen Windows
19 0.053650815 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
20 0.053170934 37 nips-2005-Benchmarking Non-Parametric Statistical Tests
topicId topicWeight
[(0, 0.185), (1, 0.082), (2, -0.066), (3, -0.012), (4, 0.065), (5, 0.136), (6, 0.109), (7, 0.072), (8, 0.017), (9, -0.096), (10, 0.018), (11, -0.027), (12, 0.042), (13, 0.012), (14, -0.047), (15, 0.204), (16, 0.095), (17, -0.092), (18, 0.109), (19, 0.007), (20, 0.12), (21, -0.076), (22, 0.097), (23, -0.035), (24, 0.03), (25, 0.002), (26, 0.069), (27, 0.039), (28, 0.127), (29, 0.107), (30, -0.073), (31, 0.061), (32, -0.149), (33, 0.051), (34, 0.027), (35, 0.127), (36, 0.046), (37, 0.104), (38, 0.224), (39, -0.097), (40, -0.099), (41, -0.094), (42, -0.196), (43, 0.046), (44, -0.056), (45, 0.105), (46, -0.045), (47, 0.044), (48, -0.023), (49, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.92873818 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
2 0.72083455 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
3 0.48990187 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
4 0.47142038 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
5 0.44947556 62 nips-2005-Efficient Estimation of OOMs
Author: Herbert Jaeger, Mingjie Zhao, Andreas Kolling
Abstract: A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. Based on observable operator models (OOMs), in the last few months a number of novel learning algorithms for similar purposes have been developed: (1,2) two versions of an ”efficiency sharpening” (ES) algorithm, which iteratively improves the statistical efficiency of a sequence of OOM estimators, (3) a constrained gradient descent ML estimator for transition-emitting HMMs (TE-HMMs). We give an overview on these algorithms and compare them with SE-HMM/EM learning on synthetic and real-life data. 1
6 0.39955738 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
7 0.38686615 112 nips-2005-Learning Minimum Volume Sets
8 0.38090113 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks
9 0.36085251 160 nips-2005-Query by Committee Made Real
10 0.35809779 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
11 0.34048158 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
12 0.3262597 105 nips-2005-Large-Scale Multiclass Transduction
13 0.31685254 189 nips-2005-Tensor Subspace Analysis
14 0.3122974 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
15 0.30832154 195 nips-2005-Transfer learning for text classification
16 0.29829791 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
17 0.28460872 77 nips-2005-From Lasso regression to Feature vector machine
18 0.27792871 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
19 0.27531189 103 nips-2005-Kernels for gene regulatory regions
20 0.27281958 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects
topicId topicWeight
[(3, 0.124), (10, 0.048), (27, 0.032), (31, 0.018), (34, 0.135), (39, 0.027), (41, 0.011), (48, 0.25), (50, 0.01), (55, 0.019), (67, 0.015), (69, 0.064), (73, 0.033), (88, 0.075), (91, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.84202111 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
2 0.64138395 177 nips-2005-Size Regularized Cut for Data Clustering
Author: Yixin Chen, Ya Zhang, Xiang Ji
Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1
3 0.63171279 112 nips-2005-Learning Minimum Volume Sets
Author: Clayton Scott, Robert Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1
4 0.63170606 58 nips-2005-Divergences, surrogate loss functions and experimental design
Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan
Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1
5 0.62956238 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
Author: Zhenyue Zhang, Hongyuan Zha
Abstract: We propose a fast manifold learning algorithm based on the methodology of domain decomposition. Starting with the set of sample points partitioned into two subdomains, we develop the solution of the interface problem that can glue the embeddings on the two subdomains into an embedding on the whole domain. We provide a detailed analysis to assess the errors produced by the gluing process using matrix perturbation theory. Numerical examples are given to illustrate the efficiency and effectiveness of the proposed methods. 1
6 0.62688875 160 nips-2005-Query by Committee Made Real
7 0.62620431 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
8 0.62282747 50 nips-2005-Convex Neural Networks
9 0.62061965 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
10 0.62010926 47 nips-2005-Consistency of one-class SVM and related algorithms
11 0.61854023 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
12 0.61590606 114 nips-2005-Learning Rankings via Convex Hull Separation
13 0.61427605 184 nips-2005-Structured Prediction via the Extragradient Method
14 0.61307967 41 nips-2005-Coarse sample complexity bounds for active learning
15 0.61015934 151 nips-2005-Pattern Recognition from One Example by Chopping
16 0.61011821 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
17 0.60655415 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
18 0.60330403 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
19 0.6018557 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
20 0.60073465 98 nips-2005-Infinite latent feature models and the Indian buffet process