nips nips2004 nips2004-67 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester
Abstract: We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1
Reference: text
sentIndex sentText sentNum sentScore
1 org Abstract We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. [sent-11, score-0.34]
2 Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. [sent-12, score-0.097]
3 We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. [sent-13, score-0.387]
4 The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. [sent-14, score-0.287]
5 The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. [sent-15, score-0.974]
6 For example x might be a word string and y a sequence of part of speech labels, or x might be a Markov random field and y a labeling of x, or x might be a word string and y a parse of x. [sent-17, score-0.465]
7 In these examples the number of possible labels y is exponential in the size of x. [sent-18, score-0.148]
8 This paper presents a training algorithm for a general definition of structured classification covering both Markov random fields and parsing. [sent-19, score-0.335]
9 We assume that pairs x, y can be embedded in a linear feature space Φ(x, y), and that a predictive rule is determined by a direction (weight vector) w in that feature space. [sent-21, score-0.156]
10 However, the case of structured labels has only recently been considered [2, 12, 3, 13]. [sent-24, score-0.339]
11 The structured-label case takes into account the internal structure of y in the assignment of feature vectors, the computation of loss, and the definition and use of margins. [sent-25, score-0.176]
12 Moreover, we assume that the feature vector for y and the loss for y are both linear in the individual bits of y. [sent-27, score-0.269]
13 The starting-point for these methods is a primal problem that has one constraint for each possible labeling y; or equivalently a dual problem where each y has an associated dual variable. [sent-30, score-0.655]
14 We give a new training algorithm that relies on an exponential-family (Gibbs distribution) representation of structured objects. [sent-31, score-0.443]
15 The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. [sent-32, score-0.287]
16 The optimization method for structured labels relies on a more general result, specifically the application of exponentiated gradient (EG) updates [7, 8] to quadratic programs (QPs). [sent-34, score-1.082]
17 The algorithm uses multiplicative updates on dual parameters in the problem. [sent-36, score-0.513]
18 In addition to their application to the structured-labels task, the EG updates lead to simple algorithms for optimizing “conventional” binary or multiclass SVM problems. [sent-37, score-0.229]
19 [5] describe exponentiated gradient algorithms for SVMs, but for binary classification in the “hard-margin” case, without slack variables. [sent-41, score-0.353]
20 We show that the EG-QP algorithm converges significantly faster than the rates shown in [5]. [sent-42, score-0.097]
21 Multiplicative updates for SVMs are also described in [11], but unlike our method, the updates in [11] do not appear to factor in a way that allows algorithms for MRFs and WCFGs based on Gibbsdistribution representations. [sent-43, score-0.35]
22 CRFs define a linear model for structured problems, in a similar way to the models in our work, and also rely on the efficient computation of marginals in the training phase. [sent-45, score-0.364]
23 The function L(x, y, y ) ˆ measures the loss when y is the true label for x, and y is a predicted label; typically, y is ˆ ˆ the label proposed by some function f (x). [sent-49, score-0.213]
24 Given a parameter vector w ∈ Rd , we consider functions of the form fw (x) = arg max Φ(x, y), w . [sent-56, score-0.182]
25 y∈G(x) Given n independent training examples (xi , yi ) with the same distribution as (X, Y ), we will formalize a large-margin optimization problem that is a generalization of support vector methods for binary classifiers, and is essentially the same as the formulation in [12]. [sent-57, score-0.24]
26 The optimal parameters are taken to minimize the following regularized empirical risk function: 1 2 w +C max (L(xi , yi , y) − mi,y (w)) y 2 + i where mi,y (w) = w, φ(xi , yi ) − w, φ(xi , y) is the “margin” on (i, y) and (z)+ = max{z, 0}. [sent-58, score-0.254]
27 This optimization can be expressed as the primal problem in Figure 1. [sent-59, score-0.199]
28 Following [12], the dual of this problem is also shown in Figure 1. [sent-60, score-0.224]
29 We use the definitions Li,y = L(xi , yi , y), and Φi,y = Φ(xi , yi ) − Φ(xi , y). [sent-63, score-0.186]
30 The constant C dictates the relative penalty for values of the slack variables i which are greater than 0. [sent-65, score-0.121]
31 program F (¯ ) in the dual variables αi,y for all i = 1 . [sent-66, score-0.311]
32 The dual variables α for each example are constrained to form a probability distribution over Y. [sent-70, score-0.311]
33 1 Models for structured classification The problems we are interested in concern structured labels, which have a natural decomposition into “parts”. [sent-72, score-0.534]
34 Formally, we assume some countable set of parts, R. [sent-73, score-0.094]
35 Thus R(x, y) is the set of parts belonging to a particular object. [sent-75, score-0.107]
36 For convenience we define indicator variables I(x, y, r) which are 1 if r ∈ R(x, y), 0 otherwise. [sent-78, score-0.087]
37 Example 1: Markov Random Fields (MRFs) In an MRF the space of labels G(x), and their underlying structure, can be represented by a graph. [sent-83, score-0.093]
38 Each clique in the graph has a set of possible configurations: for example, if a particular clique contains vertices {v3 , v5 , v6 }, the set of possible configurations of this clique is Y3 × Y5 × Y6 . [sent-93, score-0.453]
39 The feature vector representation φ(x, c, a) for each part can essentially track any characteristics of the assignment a for clique c, together with any features of the input x. [sent-96, score-0.415]
40 A number of choices for the loss function l(x, y, (c, a)) are possible. [sent-97, score-0.111]
41 For example, consider the Hamming loss used in [12], defined as L(x, y, y ) = i Iyi =ˆi . [sent-98, score-0.111]
42 To achieve this, first assign each ˆ y vertex vi to a single one of the cliques in which it appears. [sent-99, score-0.153]
43 Second, define l(x, y, (c, a)) to be the number of labels in the assignment (c, a) which are both incorrect and correspond to vertices which have been assigned to the clique c (note that assigning each vertex to a single clique avoids “double counting” of label errors). [sent-100, score-0.578]
44 Definitions: αi,y (θ) = exp( r∈R(xi ,y) Algorithm: ¯ • Choose initial values θ1 for the θi,r variables (these values can be arbitrary). [sent-108, score-0.087]
45 i,r – Set wt = C i,r∈R(xi ,yi ) φi,r − i,r∈R(xi ) µt φi,r i,r – For i = 1 . [sent-116, score-0.21]
46 n, r ∈ R(xi ), t+1 t calculate updates θi,r = θi,r + ηC (li,r + wt , φi,r ) Output: Parameter values wT +1 Figure 2: The EG algorithm for structured problems. [sent-119, score-0.734]
47 We use φi,r = φ(xi , r) and li,r = l(xi , yi , r). [sent-120, score-0.093]
48 For convenience, we restrict the grammar to be in Chomsky-normal form, where all rules in the grammar are of the form A → B C or A → a , where A, B, C are non-terminal symbols, and a is some terminal symbol. [sent-122, score-0.12]
49 The function R(x, y) maps a derivation y to the set of parts which it includes. [sent-134, score-0.143]
50 In WCFGs φ(x, r) can be any function mapping a rule production and its position in the sentence x, to a feature vector. [sent-135, score-0.112]
51 One example of a loss function would be to define l(x, y, r) to be 1 only if r’s non-terminal A is not seen spanning words s . [sent-136, score-0.158]
52 3 EG updates for structured objects We now consider an algorithm for computing α∗ = arg maxα∈∆ F (¯ ), where F (¯ ) is the ¯ α α ¯ dual form of the maximum margin problem, as in Figure 1. [sent-141, score-0.79]
53 In particular, we are interested in the optimal values of the primal form parameters, which are related to α ∗ by w∗ = ¯ ∗ C i,y αi,y Φi,y . [sent-142, score-0.152]
54 A key problem is that in many of our examples, the number of dual variables αi,y precludes dealing with these variables directly. [sent-143, score-0.398]
55 For example, in the MRF case or the WCFG cases, the set G(x) is exponential in size, and the number of dual variables αi,y is therefore also exponential. [sent-144, score-0.366]
56 We describe an algorithm that is efficient for certain examples of structured objects such as MRFs or WCFGs. [sent-145, score-0.298]
57 Instead of representing the αi,y variables explicitly, we will instead ¯ manipulate a vector θ of variables θi,r for i = 1 . [sent-146, score-0.237]
58 Thus we have one of these “mini-dual” variables for each part seen in the training data. [sent-150, score-0.164]
59 We now define the dual variables αi,y as a function of ¯ the vector θ, which takes the form of a Gibbs distribution: exp( r∈R(xi ,y) θi,r ) ¯ αi,y (θ) = . [sent-152, score-0.374]
60 The algorithm defines a sequence of α ¯ ¯ values θ1 , θ2 , . [sent-154, score-0.098]
61 The algorithm can be implemented efficiently, independently α of the dimensionality of α, provided that there is an efficient algorithm for computing ¯ ¯ ¯ marginal terms µi,r = i,y αi,y (θ)I(xi , y, r) for all i = 1 . [sent-162, score-0.15]
62 For example, in the WCFG case, the inside-outside algorithm can be used, provided that each part r is a context-free rule production, as described in Example 2 above. [sent-168, score-0.092]
63 ¯ Note that the main storage requirements of the algorithm in Figure 2 concern the vector θ. [sent-170, score-0.157]
64 This is a vector which has as many components as there are parts in the training set. [sent-171, score-0.207]
65 In practice, the number of parts in the training data can become extremely large. [sent-172, score-0.144]
66 Rather than explicitly storing the θ i,r variables, we can store a vector zt of the same dimensionality as wt . [sent-174, score-0.362]
67 In the next section we show that the original algorithm converges for any choice of 1 ¯ initial values θ1 , so this restriction on θi,r should not be significant. [sent-184, score-0.097]
68 4 Exponentiated gradient (EG) updates for quadratic programs We now prove convergence properties of the algorithm in Figure 2. [sent-185, score-0.498]
69 We show that it is an instantiation of a general algorithm for optimizing quadratic programs (QPs), which relies on Exponentiated Gradient (EG) updates [7, 8]. [sent-186, score-0.429]
70 In the general problem we assume a positive semi-definite matrix A ∈ Rm×m , and a vector b ∈ Rm , specifying a loss function Q(¯ ) = b α + 1 α A¯ . [sent-187, score-0.208]
71 In the next section we give a proof of its convergence properties. [sent-199, score-0.102]
72 The EG-QP algorithm can be used to find the minimum of −F (¯ ), and hence the maximum α of the dual objective F (¯ ). [sent-200, score-0.276]
73 Consider the sequence α(θ in Figure 2, and the sequence α1 . [sent-206, score-0.092]
74 Inputs: A positive semi-definite matrix A, and a vector b, specifying a loss function Q(¯ ) = b · α + 1 α A¯ . [sent-214, score-0.174]
75 ¯ Figure 3: The EG-QP algorithm for quadratic programs. [sent-224, score-0.127]
76 We can write F (α) = C i,y αi,y Li,y − 2 C 2 ¯ i Φ(xi , yi ) − i,y αi,y Φ(xi , y) . [sent-226, score-0.093]
77 It t α ¯ follows that ∂F (i,y ) = CLi,y + C Φ(xi , y), wt = C r∈R(xi ,y) li,r + φi,r , wt where as ∂α t before wt = C( i Φ(xi , yi ) − i,y αi,y Φ(xi , y)). [sent-227, score-0.723]
78 The rest of the proof proceeds by induction; due to space constraints we give a sketch of the proof here. [sent-228, score-0.114]
79 This follows immediately ¯ ¯ ¯ ¯ ¯ ¯ from the definitions of the mappings α(θt ) → α(θt+1 ) and αt → αt+1 in the two algo¯ ¯ ¯ ¯ ¯ ¯ ∂F (αt ) ¯ t rithms, together with the identities si,y = − ∂αi,y = −C r∈R(xi ,y) (li,r + φi,r , wt ) t+1 t and θi,r − θi,r = ηC (li,r + φi,r , wt ). [sent-230, score-0.42]
80 1 Convergence of the exponentiated gradient QP algorithm The following theorem shows how the optimization algorithm converges to an optimal solution. [sent-232, score-0.562]
81 The theorem compares the value of the objective function for the algorithm’s vector αt to the value for a comparison vector u ∈ ∆. [sent-233, score-0.173]
82 ) The convergence result is in terms of several properties of the algorithm and the comparison vector u. [sent-235, score-0.16]
83 Two other key parameters ¯ ¯ are λ, the largest eigenvalue of the positive semidefinite symmetric matrix A, and α α B = max max ( Q(¯ ))i − min ( Q(¯ ))i ≤ 2 n max |Aij | + max |bi | . [sent-245, score-0.272]
84 Define (i) Q(¯ ) as the segment of the gradient vector corα responding to the component αi of α, and define the random variable Xi,t , satisfying ¯ ¯ Pr Xi,t = − α (i) Q(¯ t ) j = αi,j . [sent-253, score-0.153]
85 The second part of the proof of the theorem involves bounding this variance in terms of the loss. [sent-260, score-0.144]
86 The following lemma relies on the fact that this variance is, to first order, the decrease in the quadratic loss, and that the second order term in the Taylor series expansion of the loss is small compared to the variance, provided the steps are not too large. [sent-261, score-0.321]
87 We shall work in the ¯t be the exponential parameters at step t, so that the exponential parameter space: let θ ¯ ¯ ¯t updates are θt+1 = θt − η Q(¯ t ), and the QP variables satisfy αi = σ(θi ). [sent-264, score-0.372]
88 5 Experiments We compared an online1 version of the Exponentiated Gradient algorithm with the factored Sequential Minimal Optimization (SMO) algorithm in [12] on a sequence segmentation task. [sent-274, score-0.15]
89 Each word is labelled by 9 possible tags (beginning of one of the four entity types, continuation of one of the types, or not-an-entity). [sent-277, score-0.206]
90 We trained a first-order Markov chain over the tags, In the online algorithm we calculate marginal terms, and updates to the w t parameters, one training example at a time. [sent-278, score-0.361]
91 where our cliques are just the nodes for the tag of each word and edges between tags of consecutive words. [sent-286, score-0.245]
92 The feature vector for each node assignment consists of the word itself, its capitalization and morphological features, etc. [sent-287, score-0.264]
93 Likewise, the feature vector for each edge assignment consists of the two words and their features as well as surrounding words. [sent-289, score-0.243]
94 Figure 4 shows the growth of the dual objective function after each pass through the data for SMO and EG, for several settings of the learning rate η and the initial setting of the θ parameters. [sent-290, score-0.224]
95 These preliminary results suggest that a hybrid algorithm could get the benefits of both, by starting out with several SMO updates and then switching to EG. [sent-292, score-0.227]
96 The key issue is to switch from the marginal µ representation SMO maintains to the Gibbs θ representation that EG uses. [sent-293, score-0.13]
97 dividing edge marginals by node marginals in this case) and then letting θ’s be the logs of the conditional probabilities. [sent-296, score-0.162]
98 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-315, score-0.117]
99 Conditional random fields: Probabilistic models for segmenting and labeling sequence data. [sent-339, score-0.101]
100 Support vector machine learning for interdependent and structured output spaces. [sent-362, score-0.309]
wordName wordTfidf (topN-words)
[('eg', 0.303), ('smo', 0.273), ('structured', 0.246), ('xi', 0.24), ('exponentiated', 0.229), ('dual', 0.224), ('wt', 0.21), ('updates', 0.175), ('primal', 0.152), ('mrfs', 0.138), ('wcfgs', 0.138), ('clique', 0.137), ('var', 0.137), ('loss', 0.111), ('parts', 0.107), ('qps', 0.103), ('gibbs', 0.097), ('yi', 0.093), ('labels', 0.093), ('gradient', 0.09), ('zt', 0.089), ('variables', 0.087), ('kivinen', 0.082), ('mrf', 0.082), ('marginals', 0.081), ('markov', 0.076), ('quadratic', 0.075), ('assignment', 0.072), ('tags', 0.072), ('string', 0.071), ('elds', 0.071), ('eta', 0.069), ('tsochantaridis', 0.069), ('wcfg', 0.069), ('lemma', 0.069), ('cliques', 0.069), ('word', 0.068), ('max', 0.068), ('relies', 0.066), ('entity', 0.066), ('vector', 0.063), ('multiplicative', 0.062), ('feature', 0.061), ('programs', 0.061), ('countable', 0.06), ('constituent', 0.06), ('grammar', 0.06), ('grammars', 0.06), ('spans', 0.059), ('proof', 0.057), ('exp', 0.055), ('exponential', 0.055), ('labeling', 0.055), ('multiclass', 0.054), ('rd', 0.053), ('algorithm', 0.052), ('arg', 0.051), ('calculate', 0.051), ('label', 0.051), ('mcallester', 0.051), ('production', 0.051), ('expectations', 0.051), ('ij', 0.05), ('gurations', 0.047), ('optimization', 0.047), ('theorem', 0.047), ('words', 0.047), ('vertex', 0.046), ('sequence', 0.046), ('marginal', 0.046), ('derivations', 0.046), ('parse', 0.046), ('crfs', 0.046), ('convergence', 0.045), ('converges', 0.045), ('taskar', 0.044), ('parsing', 0.044), ('qp', 0.044), ('internal', 0.043), ('ui', 0.043), ('concern', 0.042), ('representation', 0.042), ('margin', 0.042), ('nitions', 0.042), ('vertices', 0.042), ('st', 0.041), ('classi', 0.04), ('part', 0.04), ('vi', 0.038), ('training', 0.037), ('calculated', 0.036), ('consecutive', 0.036), ('cristianini', 0.036), ('derivation', 0.036), ('ne', 0.035), ('rm', 0.035), ('ik', 0.035), ('assume', 0.034), ('slack', 0.034), ('bartlett', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester
Abstract: We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1
2 0.26750091 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
3 0.2466252 178 nips-2004-Support Vector Classification with Input Data Uncertainty
Author: Jinbo Bi, Tong Zhang
Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1
4 0.17057188 44 nips-2004-Conditional Random Fields for Object Recognition
Author: Ariadna Quattoni, Michael Collins, Trevor Darrell
Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.
5 0.13682504 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
6 0.12625133 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
7 0.12572968 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
8 0.12525909 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
9 0.12499857 92 nips-2004-Kernel Methods for Implicit Surface Modeling
10 0.11931521 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers
11 0.11722038 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
12 0.11281808 113 nips-2004-Maximum-Margin Matrix Factorization
13 0.10390649 136 nips-2004-On Semi-Supervised Classification
14 0.099513143 54 nips-2004-Distributed Information Regularization on Graphs
15 0.098311342 100 nips-2004-Learning Preferences for Multiclass Problems
16 0.098256677 19 nips-2004-An Application of Boosting to Graph Classification
17 0.097777247 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
18 0.096253321 115 nips-2004-Maximum Margin Clustering
19 0.09303233 42 nips-2004-Computing regularization paths for learning multiple kernels
20 0.090040118 70 nips-2004-Following Curved Regularized Optimization Solution Paths
topicId topicWeight
[(0, -0.31), (1, 0.102), (2, 0.077), (3, 0.16), (4, 0.119), (5, -0.061), (6, 0.047), (7, 0.058), (8, -0.106), (9, 0.071), (10, 0.159), (11, 0.034), (12, -0.154), (13, 0.144), (14, 0.095), (15, 0.125), (16, -0.045), (17, 0.137), (18, -0.074), (19, 0.052), (20, 0.022), (21, 0.054), (22, -0.053), (23, -0.023), (24, -0.098), (25, -0.008), (26, 0.039), (27, -0.038), (28, -0.142), (29, -0.076), (30, -0.078), (31, -0.017), (32, 0.008), (33, 0.079), (34, -0.027), (35, -0.069), (36, 0.02), (37, -0.064), (38, 0.027), (39, 0.039), (40, -0.045), (41, 0.098), (42, -0.015), (43, -0.02), (44, 0.162), (45, -0.053), (46, 0.053), (47, -0.006), (48, -0.065), (49, -0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.95907587 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester
Abstract: We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1
2 0.72106951 178 nips-2004-Support Vector Classification with Input Data Uncertainty
Author: Jinbo Bi, Tong Zhang
Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1
3 0.68263853 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
4 0.64542937 44 nips-2004-Conditional Random Fields for Object Recognition
Author: Ariadna Quattoni, Michael Collins, Trevor Darrell
Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.
5 0.64151436 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
Author: Chiranjib Bhattacharyya, Pannagadatta K. Shivaswamy, Alex J. Smola
Abstract: We propose a convex optimization based strategy to deal with uncertainty in the observations of a classification problem. We assume that instead of a sample (xi , yi ) a distribution over (xi , yi ) is specified. In particular, we derive a robust formulation when the distribution is given by a normal distribution. It leads to Second Order Cone Programming formulation. Our method is applied to the problem of missing data, where it outperforms direct imputation. 1
6 0.59901655 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
7 0.57028729 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference
8 0.53032452 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
9 0.52717954 19 nips-2004-An Application of Boosting to Graph Classification
10 0.51110643 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
11 0.50127041 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks
12 0.47832912 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
13 0.47691265 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
14 0.46096694 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
15 0.44323969 92 nips-2004-Kernel Methods for Implicit Surface Modeling
16 0.44210446 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation
17 0.44189805 17 nips-2004-Adaptive Manifold Learning
18 0.4344959 151 nips-2004-Rate- and Phase-coded Autoassociative Memory
19 0.43250561 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
20 0.41406435 100 nips-2004-Learning Preferences for Multiclass Problems
topicId topicWeight
[(13, 0.134), (15, 0.139), (26, 0.075), (31, 0.024), (32, 0.255), (33, 0.184), (35, 0.01), (39, 0.04), (50, 0.035), (53, 0.014), (87, 0.012)]
simIndex simValue paperId paperTitle
1 0.87231344 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve
Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth
Abstract: The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an -accurate estimate of the expected accuracy of a ranking function with δ-confidence is larger than that required to obtain an -accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes. 1
2 0.86619931 23 nips-2004-Analysis of a greedy active learning strategy
Author: Sanjoy Dasgupta
Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1
same-paper 3 0.84418601 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester
Abstract: We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1
4 0.75569642 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
5 0.74364138 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
6 0.73540539 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
7 0.73110914 131 nips-2004-Non-Local Manifold Tangent Learning
8 0.72527146 163 nips-2004-Semi-parametric Exponential Family PCA
9 0.7252686 70 nips-2004-Following Curved Regularized Optimization Solution Paths
10 0.72433037 102 nips-2004-Learning first-order Markov models for control
11 0.72205973 116 nips-2004-Message Errors in Belief Propagation
12 0.72199309 178 nips-2004-Support Vector Classification with Input Data Uncertainty
13 0.7208209 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
14 0.72001964 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
15 0.71970773 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
16 0.71947533 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
17 0.71903151 69 nips-2004-Fast Rates to Bayes for Kernel Machines
18 0.71855301 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
19 0.71850222 124 nips-2004-Multiple Alignment of Continuous Time Series
20 0.71674567 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction