jmlr jmlr2008 jmlr2008-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
Reference: text
sentIndex sentText sentNum sentScore
1 of Computer Science University of Bristol Bristol, BS8 1TR, UK Editor: Michael Collins Abstract Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. [sent-11, score-0.459]
2 Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound 1. [sent-15, score-0.207]
3 In fact in many cases the structured output prediction approach matches practice more closely. [sent-19, score-0.216]
4 In contrast, in structured output prediction the output space is typically massive, containing a rich structure relating the different output values c 2008 Elisa Ricci, Tijl De Bie and Nello Cristianini. [sent-22, score-0.376]
5 1 Graphical and Grammatical Models for Structured Data An immediate approach for structured output prediction would be to use a probabilistic model jointly over the input and the output variables. [sent-27, score-0.296]
6 Such methods, known as discriminative learning algorithms (DLAs), make predictions by optimizing a scoring function over the output space, where this scoring function has not necessarily a probabilistic interpretation. [sent-34, score-0.222]
7 , 2001), re-ranking with perceptron (Collins, 2002b), hidden Markov perceptron (HMP) (Collins, 2002a), sequence labeling with boosting (Altun et al. [sent-37, score-0.31]
8 In particular, they all make use of a scoring function that is linear in a set of parameters to score each element of the output space. [sent-45, score-0.236]
9 , 2005) prescribe to seek hypotheses that make the score of the correct outputs in the training set larger than all incorrect ones (by a certain margin). [sent-57, score-0.212]
10 We provide specific examples of how these moments can be computed for three types of structured output prediction problems: the sequence alignment problem, sequence labeling, and learning to parse with a context free grammar for RNA secondary structure prediction. [sent-63, score-0.734]
11 The first approach is the maximization of the Z-score, a common statistical measure of surprise, which is large if the scores of the correct outputs in the training set are significantly different from the scores of all incorrect outputs in the output space. [sent-65, score-0.334]
12 3 Outline of this Paper The rest of the paper is structured as follows: Section 2 formally introduces the problem of structured output learning and the hypothesis space considered. [sent-72, score-0.291]
13 For example in sequence alignment learning the output variables parameterize the alignment between two sequences, in sequence labeling y is the label sequence associated to the observed sequence x, and when learning to parse y represents a parse tree corresponding to a given sequence x. [sent-86, score-1.13]
14 y∈Y (2) This type of prediction function has been used in previous approaches for structured output prediction. [sent-95, score-0.216]
15 ¯ Consider a loss function Lθ that maps the input x and the true training output y to a positive ¯ ¯ real number Lθ (x, y), in some way measuring the discrepancy between the prediction h θ (x) and y. [sent-114, score-0.206]
16 Therefore, where perfect prediction of the data cannot be achieved using the hypothesis space considered, it could be more useful to measure the fraction of outputs for which the score is ranked higher than for the correct output. [sent-136, score-0.22]
17 1 S EQUENCE L ABELING L EARNING In sequence labeling tasks a sequence is taken as an input, and the output to be predicted is a sequence that annotates the input sequence, that is, with a symbol corresponding to each symbol in the input sequence. [sent-145, score-0.478]
18 Each observed symbol xi is an element of the observed symbol alphabet Σx , and the hidden symbols yi are elements of Σy , with no = |Σx | and nh = |Σy | the respective alphabet sizes. [sent-158, score-0.523]
19 Furthermore, it is assumed that the probability distribution of the observed symbol xk depends solely on the value of yk (quantified by the emission probability distribution P(xk |yk )). [sent-162, score-0.233]
20 Thus, to fully specify the HMM, one needs to consider all the transition probabilities (denoted ti j for i, j ∈ Σy for the transition from symbol i to j), and the emission probabilities (denoted e io for the emission of symbol o ∈ Σx by symbol i ∈ Σy ). [sent-167, score-0.521]
21 In general in fact the 2809 R ICCI , D E B IE AND C RISTIANINI vector φ(x, y) contains not only statistics associated to transition and emission probabilities but also any feature that reflects the properties of the objects represented by the nodes of the HMM. [sent-181, score-0.213]
22 2 S EQUENCE A LIGNMENT L EARNING As second case studied, we consider the problem of learning how to align sequences: given as training examples a set of correct pairwise global alignments, find the parameter values that ensure sequences are optimally aligned. [sent-192, score-0.217]
23 This task is also known as inverse parametric sequence alignment problem (IPSAP) and since its introduction in Gusfield et al. [sent-193, score-0.233]
24 The strings are ordered sequences of symbols si ∈ S , with S a finite alphabet of size nS . [sent-198, score-0.295]
25 In case of biological applications, for DNA sequences the alphabet contains the symbols associated with nucleotides (S = {A, C, G, T}), while for amino acids sequences the alphabet is S = {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V}. [sent-199, score-0.498]
26 An alignment of two strings S1 and S2 of lengths n1 and n2 is defined as a pair of sequences T1 and T2 of equal length n ≥ n1 , n2 that are obtained by taking S1 and S2 respectively and inserting symbols − at various locations in order to arrive at strings of length n. [sent-200, score-0.429]
27 In analogy with the notation in this paper, the pair of given sequences S 1 and S2 represent the input variable x while their alignment is the output y. [sent-207, score-0.385]
28 2 depicts a pairwise alignment between two sequences and the associated path in the alignment graph. [sent-210, score-0.508]
29 However, an efficient DP algorithm for computing the alignment with ¯ maximal score y is known in literature: the Needleman-Wunsch algorithm (Needleman, 1970). [sent-212, score-0.256]
30 Then the score of an alignment is: sθ (x, y) = θm m + θs s + θo o + θe e 2810 M AGIC M OMENTS FOR S TRUCTURED O UTPUT P REDICTION Figure 2: An alignment y between two sequences S1 and S2 can be represented by a path in the alignment graph. [sent-215, score-0.732]
31 In general there are d = nS (n2 +1) different parameters in θ associated with the symbols of the alphabet plus two additional ones corresponding to the gap penalties. [sent-220, score-0.209]
32 This means that to align sequences of amino acids we have 210 parameters to determine plus other 2 parameters for gap opening and gap extension. [sent-221, score-0.283]
33 Again the score is a linear function of the parameters: sθ (x, y) = ∑ θ jk z jk + θo o + θe e j≥k and the optimal alignment is computed by the Needleman-Wunsch algorithm. [sent-223, score-0.256]
34 3 L EARNING TO PARSE In learning to parse the input x is given by a sequence, and the output is given by its associated parse tree according to a context free grammar. [sent-226, score-0.372]
35 Learning to parse has been already studied as a particular instantiation of structured output learning, both in natural language processing applications (Tsochantaridis et al. [sent-228, score-0.295]
36 , 2004) and in computational biology for RNA secondary structure alignment (Sato and Sakakibara, 2005) and prediction (Do et al. [sent-230, score-0.269]
37 Given a sequence x and an associated parse tree y we can define a feature vector φ(x, y) which contains a count of the number of occurrences of each of the rules in the parse tree y. [sent-250, score-0.394]
38 Computing the Moments of the Scoring Function An interesting corollary of the proposed structured output approach based on linear scoring functions is that certain statistics of the score s(x, y) can be expressed as function of the parameter vector θ. [sent-254, score-0.321]
39 The matrix C is a matrix with elements: c pq = = N 1 N j=1 1 N ∑ ∑ (φ p (x, y j ) − µ p )(φq (x, y j ) − µq ) N j=1 (4) φ p (x, y j )φq (x, y j ) − µ p µq = v pq − µ p µq where 1 ≤ p, q ≤ d. [sent-261, score-0.532]
40 1 Magic Moments It should be clear that in practical structured output learning problems the number N of possible output vectors associated to a given input x can be massive. [sent-263, score-0.277]
41 2 Sequence Labeling Learning Given a fixed input sequence x, we show here for the sequence labeling example that the elements of µ and C can be computed exactly and efficiently by dynamic programming routines. [sent-272, score-0.23]
42 We first consider the vector µ and construct it in a way that the first n h no elements contain the mean values associated with the emission probabilities and the remaining n 2 elements correspond h to transition probabilities. [sent-273, score-0.213]
43 In the emission part for each element a nh × m dynamic programming table µe is considered. [sent-275, score-0.263]
44 pq The index p denotes the hidden state (1 ≤ p ≤ nh ) and q refers to the observation (1 ≤ q ≤ no ). [sent-276, score-0.454]
45 Basically at step (i, j) the mean value pq µe (i, j) is given summing the occurrences of emission probabilities e pq at the previous steps (e. [sent-281, score-0.626]
46 , pq ∑i µe (i, j − 1)π(i, j − 1)) with the number of paths in the previous steps (if the current observation pq x j is q and the current state y j is p) and dividing this quantity by π(i, j). [sent-283, score-0.47]
47 This computation pq pqp pqp is again performed following Algorithm 1 but with recursive relations given in Algorithm 4, in appendix E (the number 5, 11, 12 in Algorithm 4 are meant to indicate the lines of Algorithm 1 where the formulas must be inserted). [sent-289, score-0.533]
48 Algorithm 1 Computation of µe for sequence labeling learning pq 1: Input: x = (x1 , x2 , . [sent-290, score-0.403]
49 Moreover usually in most applications the size of the observation alphabet (for example the size of the dictionary in a natural language processing system) is very large while the sequences to be labeled are short. [sent-308, score-0.208]
50 We point out that the proposed algorithm can be easily extended to the case of arbitrary features in the vector φ(x, y) (not only those associated with transition and emission probabilities). [sent-311, score-0.213]
51 To compute µ and C in these situations the derivation of appropriate formulas similar to those of µ e , ce pq pq and cet z is straightforward. [sent-312, score-0.567]
52 3 Sequence Alignment Learning For the sequence alignment learning task we consider separately the three parameter model, the model with affine gap penalties and the model with substitution matrices. [sent-321, score-0.323]
53 In fact each alignment corresponds to a path in the alignment graph associated with the DP matrix. [sent-329, score-0.374]
54 The covariance matrix C is the 3×3 matrix with elements c pq , p, q ∈ {m, s, g} and it is symmetric (csg = cgs , cmg = cgm , csm = cms ). [sent-334, score-0.297]
55 Each value c pq can be obtained considering (4) and computing the associated values v pq with appropriate recursive relations (see Algorithm 2). [sent-335, score-0.502]
56 2 A FFINE G AP P ENALTIES As before we can define the vector µ = [µm µs µo µe ]T and the covariance matrix C as the 4 × 4 symmetric matrix with elements c pq with p, q ∈ {m, s, o, e}. [sent-338, score-0.297]
57 In particular µm , µs , vmm , vms and vss are calculated as above, while the other values are obtained with the formulas in Algorithm 5 in appendix E. [sent-340, score-0.223]
58 For the others it is: µz pq (i, j) := µz pq (i − 1, j)π(i − 1, j) + µz pq (i, j − 1)π(i, j − 1) + (µz pq (i − 1, j − 1) + M)π(i − 1, j − 1) π(i, j) where M = 1 when two corresponding symbols in the alignment are equal to p and q or vice versa with p, q ∈ S . [sent-350, score-1.161]
59 The values v eo , vee and voo are calculated as above. [sent-352, score-0.237]
60 The derivation of formulas for vz pq z p q is straightforward from vms considering the appropriate values for M and the mean values. [sent-353, score-0.346]
61 4 Learning to Parse For a given input string x, let µ p and c pq be the mean of occurrences of rule p and the covariance between the numbers of occurrences of rules p and q, respectively, that is, the elements of µ and C. [sent-356, score-0.315]
62 We use two types of auxiliary variables, π(s,t, ϒ i ) and π(s,t, ϒi , α) which are the number of possible parse trees whose root is ϒ i for substring xs|t , and the number of possible parse trees whose root is applied to rule ϒ i → α for substring xs|t , where (ϒi → α) ∈ R. [sent-367, score-0.26]
63 We count the number of cooccurrences γ pq (s,t, ϒi ) in each pair p and q of rules. [sent-375, score-0.275]
64 γ pq (s,t, ϒi , α) denotes the number of cooccurrences in all possible parse trees whose root is ϒ i for xs|t . [sent-376, score-0.405]
65 Finally, γ pq (1, m, S) is the number of cooccurrences of rules p and q in all parse trees given x. [sent-378, score-0.405]
66 More specifically, the objective function we propose is the difference between the score of the true output and the mean score of the distribution, divided by the square root of the variance as a normalization. [sent-393, score-0.25]
67 If the normality assumption is too unrealistic, one could still apply a (looser) Chebyshev tail bound to show that the number of scores that exceed the ¯ score of a large training output score sθ (x, y) is small. [sent-403, score-0.374]
68 Their approach to structured output learning is to explicitly search for the parameter values θ ¯ such that the optimal hidden variables yi can be reconstructed from xi , ∀1 ≤ i ≤ . [sent-460, score-0.237]
69 In Doolittle (1981) Z-scores are used to assess the significance of a pairwise alignment between two aminoacid sequences and are computed calculating the mean and the standard deviation values over a random sample taken from a standard database or obtained permuting the given sequence. [sent-497, score-0.305]
70 They proposed an efficient algorithm that finds the standardized score in the case of permutations of the original sequences but this approach is limited to the ungapped sequences. [sent-502, score-0.219]
71 4 SODA: Structured Output Discriminant Analysis Another way to extend problem (5) to the general situation of a training set T is to minimize the empirical risk associated to the upper bound on the relative ranking loss R RRU , defined in the usual way as: RθRRU (T ) = RRU ¯ ∑ Lθ (xi , yi ). [sent-505, score-0.219]
72 Experimental Results In this subsection we provide some experimental results for the three illustrative examples proposed: sequence labeling, sequence alignment and sequence parse learning. [sent-530, score-0.487]
73 1 Sequence Labeling Learning The first series of experiments, developed in the context of sequence labeling learning, analyzes the behavior of the Z-score based algorithm and of the SODA using both artificial data and sequences of text for named entity recognition. [sent-532, score-0.336]
74 We consider two different HMMs, one with n h = 2, no = 4 and one with nh = 3, no = 5, with assigned transition and emission probabilities. [sent-536, score-0.328]
75 For these models, we generate hidden and observed sequences of length 100. [sent-537, score-0.206]
76 The regularization parameters associated to each method are determined based on the performance on a separate validation set of 100 sequences generated together with the training and the test sets. [sent-546, score-0.206]
77 In fact in the situations where the size of the hidden and the observed space is large and long sequences must be considered, the computation of b∗ and C∗ with DP can be quite time consuming. [sent-604, score-0.206]
78 The hidden alphabet is limited to nh = 9 different labels, since the expression types are only persons, organizations, locations and miscellaneous names. [sent-625, score-0.293]
79 2829 Average number of correct hidden sequences (%) R ICCI , D E B IE AND C RISTIANINI 100 Z−score (constr) MM HMP 90 80 70 60 50 40 30 20 10 0 0 20 40 60 Number of constraints 80 100 Figure 7: Average number of correctly reconstructed hidden sequences for an HMM with n h = 2 and no = 4. [sent-697, score-0.443]
80 A pair of observed and hidden sequences of length m = 100 is considered. [sent-713, score-0.206]
81 The task is to estimate the values of transition and emission probabilities such that the observed sequences are generated by the hidden one. [sent-714, score-0.387]
82 7 the histograms obtained binning the number of constraints needed to reconstruct the original transition and emission probabilities is shown for an HMM with nh = 2 and no = 4. [sent-717, score-0.359]
83 5 Sequence Alignment Learning The second series of experiments has been performed in the context of sequence alignment learning. [sent-722, score-0.233]
84 For the 211 parameter model we also compare SODA with a generative sequence alignment model, where substitution rates between amino acids are computed using Laplace estimates. [sent-802, score-0.311]
85 Here we only consider training sets of size up to 500 pairs of sequences since the advantage in terms of test error for SODA (and in general for all discriminative approaches) with respect to generative approaches is more evident for training sets of small sizes. [sent-815, score-0.255]
86 Weights of the grammar are optimized with a training set, and structures associated to sequences in the test set are predicted by the Viterbi algorithm. [sent-827, score-0.206]
87 Hereby it is good to keep in mind that this loss itself is an upper bound for the relative ranking loss, such that the Rademacher bound is also a bound on the expectation of the relative ranking loss. [sent-890, score-0.227]
88 Theorem 8 (On the PAC-learnability of structured output prediction) Given a hypothesis space H of prediction functions hθ as defined above. [sent-919, score-0.257]
89 In this way, we have derived two new efficient algorithms for structured output prediction that rely on these statistics, both of which can be solved by solving one linear system of equations. [sent-940, score-0.216]
90 All the elements associated to transition probabilities assume the same values while for emission probability µ e = µe f , pq e ∀ q = f. [sent-951, score-0.448]
91 It is a symmetric block matrix made basically by three components: the block associated to emission probabilities, that of transition probabilities and that relative to mixed terms. [sent-953, score-0.244]
92 In the emission part there are 2no possible different values since ce = ce f , ∀q = f , ce q = 0, ∀q = q and ce q = pq e pqp pqp ce f e f ∀q = q = f = f . [sent-955, score-0.593]
93 In fact there are no values cet z with p = p = z , no values cet z , with p = p , p = z , no values cet z , with p = z , p = z pqp pqp pqp and no values cet z , with p = p , z = z . [sent-960, score-0.527]
94 This approach is suited to sequence labeling problems and HMM features and it is particularly effective for problems when n h , the size of the hidden state alphabet, is small and no , the size of the observation alphabet, is large. [sent-970, score-0.24]
95 Here E denotes the block associated to emission probabilities, T that corresponding to transition probabilities and M that relative to mixed terms. [sent-979, score-0.213]
96 , sequence labeling problems for text analysis such as NER or POS) the emission part represents the main bottleneck in the computation of the inverse since its size is dependent on n o (e. [sent-983, score-0.284]
97 In particular the matrix obtained by E −1 MP−1 M T E −1 is a block matrix made by nh × nh equal blocks. [sent-1047, score-0.356]
98 To make this more nm mnm concrete, for the HMM prediction problem discussed earlier, this is equal to nm o = nh o , which h is doubly exponential in the length of the sequences m. [sent-1108, score-0.332]
99 Algorithms This section contains additional formulas that can be used for moments computation respectively in case of sequence labeling (Algorithm 4) and of sequence alignment (Algorithm 5). [sent-1133, score-0.503]
100 Generalization bounds and consistency for structured labeling in predicting structured data. [sent-1267, score-0.276]
wordName wordTfidf (topN-words)
[('soda', 0.397), ('pq', 0.235), ('rru', 0.21), ('hmp', 0.186), ('icci', 0.178), ('ristianini', 0.178), ('hmm', 0.172), ('alignment', 0.171), ('agic', 0.169), ('oments', 0.169), ('eo', 0.157), ('rademacher', 0.152), ('nh', 0.147), ('mm', 0.145), ('utput', 0.144), ('sequences', 0.134), ('parse', 0.13), ('tructured', 0.129), ('dlas', 0.121), ('pqp', 0.121), ('rediction', 0.118), ('emission', 0.116), ('ie', 0.108), ('labeling', 0.106), ('crfs', 0.096), ('rna', 0.094), ('dp', 0.092), ('structured', 0.085), ('score', 0.085), ('vgg', 0.081), ('output', 0.08), ('alphabet', 0.074), ('hidden', 0.072), ('scoring', 0.071), ('altun', 0.067), ('transition', 0.065), ('yk', 0.064), ('sequence', 0.062), ('pac', 0.057), ('vmm', 0.056), ('vss', 0.056), ('formulas', 0.056), ('vms', 0.055), ('symbol', 0.053), ('gap', 0.053), ('tsochantaridis', 0.051), ('prediction', 0.051), ('symbols', 0.05), ('basepairs', 0.048), ('pgms', 0.048), ('vme', 0.048), ('vmg', 0.048), ('vmo', 0.048), ('vsg', 0.048), ('secondary', 0.047), ('moments', 0.046), ('incorrect', 0.044), ('align', 0.043), ('outputs', 0.043), ('taskar', 0.042), ('scores', 0.042), ('bound', 0.042), ('hypothesis', 0.041), ('cet', 0.041), ('sentences', 0.041), ('generative', 0.041), ('cooccurrences', 0.04), ('fda', 0.04), ('ricci', 0.04), ('vee', 0.04), ('voo', 0.04), ('training', 0.04), ('occurrences', 0.04), ('ni', 0.038), ('ner', 0.037), ('risk', 0.037), ('xs', 0.037), ('hmms', 0.037), ('strings', 0.037), ('bbt', 0.037), ('substitution', 0.037), ('collins', 0.036), ('loss', 0.035), ('perceptron', 0.035), ('mismatches', 0.034), ('entity', 0.034), ('hamming', 0.033), ('matrices', 0.033), ('ranking', 0.033), ('ctpzp', 0.032), ('tpz', 0.032), ('veo', 0.032), ('yij', 0.032), ('associated', 0.032), ('bt', 0.032), ('cardinality', 0.032), ('arg', 0.031), ('matrix', 0.031), ('constraints', 0.031), ('alignments', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
2 0.093511559 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
3 0.077270955 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
4 0.072345071 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
5 0.067318074 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
6 0.062491592 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
7 0.062412459 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
8 0.061446615 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
9 0.059708212 92 jmlr-2008-Universal Multi-Task Kernels
10 0.058981113 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
11 0.056319766 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
12 0.055992696 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
13 0.050156184 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
14 0.049105663 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
15 0.048554238 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
16 0.048307952 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
17 0.045102231 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
18 0.044957399 58 jmlr-2008-Max-margin Classification of Data with Absent Features
19 0.043875698 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
topicId topicWeight
[(0, 0.227), (1, -0.05), (2, -0.004), (3, -0.015), (4, -0.055), (5, 0.029), (6, 0.04), (7, 0.017), (8, -0.042), (9, -0.091), (10, -0.087), (11, 0.039), (12, -0.112), (13, 0.117), (14, 0.032), (15, -0.157), (16, -0.131), (17, 0.029), (18, -0.141), (19, -0.138), (20, -0.063), (21, 0.117), (22, -0.147), (23, -0.025), (24, 0.066), (25, 0.204), (26, -0.208), (27, -0.009), (28, -0.032), (29, 0.065), (30, -0.209), (31, 0.161), (32, -0.029), (33, 0.057), (34, 0.179), (35, -0.027), (36, 0.18), (37, -0.072), (38, 0.035), (39, -0.015), (40, -0.06), (41, 0.082), (42, -0.17), (43, 0.08), (44, 0.139), (45, 0.23), (46, -0.094), (47, 0.156), (48, 0.119), (49, 0.11)]
simIndex simValue paperId paperTitle
same-paper 1 0.92161155 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
2 0.41675806 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
3 0.37082547 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
4 0.34810245 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
5 0.34494737 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
6 0.33836561 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
7 0.31461796 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
8 0.29930449 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
9 0.28417563 92 jmlr-2008-Universal Multi-Task Kernels
10 0.27166063 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
11 0.26090163 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
12 0.25885499 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
13 0.23846439 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
14 0.23550345 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
15 0.23461576 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
16 0.23206303 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
17 0.23079692 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
18 0.22097856 60 jmlr-2008-Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis
19 0.20608622 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
20 0.18826756 57 jmlr-2008-Manifold Learning: The Price of Normalization
topicId topicWeight
[(0, 0.019), (5, 0.023), (31, 0.021), (40, 0.039), (54, 0.06), (58, 0.054), (66, 0.069), (76, 0.035), (88, 0.06), (91, 0.393), (92, 0.058), (94, 0.047), (99, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.72042984 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
2 0.69782788 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
Author: Giorgio Corani, Marco Zaffalon
Abstract: In this paper, the naive credal classifier, which is a set-valued counterpart of naive Bayes, is extended to a general and flexible treatment of incomplete data, yielding a new classifier called naive credal classifier 2 (NCC2). The new classifier delivers classifications that are reliable even in the presence of small sample sizes and missing values. Extensive empirical evaluations show that, by issuing set-valued classifications, NCC2 is able to isolate and properly deal with instances that are hard to classify (on which naive Bayes accuracy drops considerably), and to perform as well as naive Bayes on the other instances. The experiments point to a general problem: they show that with missing values, empirical evaluations may not reliably estimate the accuracy of a traditional classifier, such as naive Bayes. This phenomenon adds even more value to the robust approach to classification implemented by NCC2. Keywords: naive Bayes, naive credal classifier, imprecise probabilities, missing values, conservative inference rule, missing at random
3 0.35679376 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
4 0.34179452 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
5 0.33939648 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
6 0.33520654 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
7 0.33103457 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
8 0.32857305 83 jmlr-2008-Robust Submodular Observation Selection
9 0.32780442 86 jmlr-2008-SimpleMKL
10 0.32583895 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.32300633 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
12 0.32287911 57 jmlr-2008-Manifold Learning: The Price of Normalization
13 0.32104394 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
14 0.31831884 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
16 0.31688184 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
17 0.31674019 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
18 0.3155075 9 jmlr-2008-Active Learning by Spherical Subdivision
19 0.31541249 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
20 0.31520975 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers