jmlr jmlr2008 jmlr2008-31 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we propose an integrated web data extraction paradigm with hierarchical models. [sent-9, score-0.81]
2 We apply DHMRFs to a real-world web data extraction task. [sent-15, score-0.508]
3 Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue 1. [sent-17, score-1.012]
4 However, existing approaches use highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. [sent-26, score-0.629]
5 This paper is to first propose an integrated web data extraction paradigm with hierarchical Markov Random Fields, and then address the blocky artifact issue (Irving et al. [sent-27, score-0.933]
6 However, how to extract product information from webpages generated by many (maybe tens of thousands of) different templates is non-trivial. [sent-38, score-0.376]
7 List pages are webpages containing several structured data records, and detail pages are webpages only containing detailed information about a single object. [sent-48, score-0.369]
8 However, it is highly ineffective to use decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. [sent-56, score-0.629]
9 The reasons for this are: Error Propagation: as record detection and attribute labeling are two separate phases, the errors in record detection will be propagated to attribute labeling. [sent-57, score-1.01]
10 more effective record detection algorithms should take into account the semantic labels of the text, but existing methods (Zhai and Liu, 2005; Lerman et al. [sent-68, score-0.403]
11 To address the above problems, the first part of this paper is to propose an integrated web data extraction paradigm. [sent-80, score-0.63]
12 Specifically, we take a vision-tree representation of webpages and define both record detection and attribute labeling as assigning semantic labels to the nodes on the trees. [sent-81, score-0.906]
13 Then, we can define the integrated web data extraction that performs record detection and attribute labeling simultaneously. [sent-82, score-1.201]
14 Based on the tree representation, we define a simple integrated web data extraction model—Hierarchical Conditional Random Fields (HCRFs), whose structures are determined by vision-trees. [sent-83, score-0.671]
15 Thus, effective web data extraction models should have the capability to adapt their structures during the inference process. [sent-94, score-0.539]
16 But in the scenario of web data, sharing CPTs can be difficult because the hierarchical structures are not as regular as the dyadic or quad trees in image processing. [sent-103, score-0.552]
17 The performance of our models is demonstrated on a web data extraction task—production information extraction. [sent-117, score-0.539]
18 Section 3 presents an integrated web data extraction paradigm and fixed-structured Hierarchical Conditional Random Fields. [sent-123, score-0.63]
19 Preliminary Background Knowledge The background knowledge, on which the following work is based, is from web data extraction and statistical hierarchical modeling. [sent-130, score-0.688]
20 1 Web Data Extraction Web data extraction is an information extraction (IE) task that identifies information of interest from webpages. [sent-133, score-0.476]
21 The difference of web data extraction from traditional IE is that various types of 1587 Z HU , N IE , Z HANG AND W EN structural dependencies between HTML elements exist. [sent-134, score-0.553]
22 For example, the HTML tag tree is itself hierarchical and each webpage is displayed as a two-dimensional image to readers. [sent-135, score-0.434]
23 This paper is to explore both hierarchical and two-dimensional spatial information for more effective web data extraction. [sent-139, score-0.45]
24 They take in some manually labeled webpages and learn some extraction rules (or wrappers). [sent-142, score-0.393]
25 Since the learned wrappers can only be used to extract data from similar pages, maintaining the wrappers as web sites change will require substantial efforts. [sent-143, score-0.348]
26 2 Statistical Hierarchical Modeling Multi-scale or hierarchical statistical modeling has shown great promise in image labeling (Kato et al. [sent-167, score-0.382]
27 For example, fixed models in image processing often lead to the blocky artifact issue, and the similar problem arises in web data extraction due to the diversity of web data. [sent-181, score-1.002]
28 Integrated Web Data Extraction In this section, we formally define the integrated web data extraction and also propose Hierarchical Conditional Random Fields (HCRFs) to perform that task. [sent-201, score-0.63]
29 Good representation can make the extraction task easier and improve extraction accuracy. [sent-204, score-0.476]
30 2 Record Detection and Attribute Labeling Based on the definition of vision-tree, we now formally define the concepts of record detection and attribute labeling. [sent-227, score-0.439]
31 1589 Z HU , N IE , Z HANG AND W EN (a) (b) (c) Figure 3: (a) Partial vision-tree of the webpage in Figure 1(a); (b) An HCRF model with linearchain neighborhood between sibling nodes; (c) Another HCRF model with 2D neighborhood between sibling nodes and between nodes that share a grand-parent. [sent-228, score-0.369]
32 1 (Record detection): Given a vision-tree, record detection is the task of locating the root of a minimal subtree that contains the content of a record. [sent-234, score-0.331]
33 2 (Attribute labeling): For each identified record, attribute labeling is the task of assigning attribute labels to the leaf blocks (elements) within the record. [sent-240, score-0.603]
34 We can build a complete model to extract both records and attributes by sequentially combining existing record detection and attribute labeling algorithms. [sent-241, score-0.772]
35 Therefore, we propose an integrated approach that conducts simultaneous record extraction and attribute labeling. [sent-243, score-0.699]
36 3 Integrated Web Data Extraction Based on the above definitions, both record detection and attribute labeling are the task of assigning labels to blocks of the vision-tree for a webpage. [sent-245, score-0.696]
37 Formally, we define the integrated web data extraction as: Definition 3. [sent-247, score-0.63]
38 The goal of web data extraction is to find a label assignment y that has the maximum posterior probability y = arg maxy p(y|x), and extract data from this assignment. [sent-255, score-0.601]
39 4 Hierarchical Conditional Random Fields In this section, we first introduce some basics of Conditional Random Fields and then propose Hierarchical Conditional Random Fields for integrated web data extraction. [sent-257, score-0.392]
40 Then, T = ∪L−1 {(i, j, l) : 0 ≤ i, j < Nd , 0 ≤ l < Nd−1 , ni j = d=1 1, sil = 1, and s jl = 1} is the set of triangles in the graph G. [sent-288, score-0.356]
41 Then, hierarchical statistical modeling is a task to construct an appropriate hierarchical model structure and carry out inference about the labels of given observations. [sent-316, score-0.411]
42 We will give an example of web data extraction in the experiment section. [sent-318, score-0.508]
43 This assumption holds in applications such as web data extraction and image processing. [sent-348, score-0.578]
44 With the structure model, the first part of the energy when the system is at the state (s, y) is E1 (s, y, x) = ∑ µk ∑ sil s jl ni j gk (i, j, l, x), k i jl where a triple (i, j, l) denotes a particular position in the dynamic model. [sent-353, score-0.668]
45 For example, in web data extraction only the labels of leaf nodes are observable and both the hierarchical structures and the labels of inner nodes are hidden. [sent-386, score-1.041]
46 Let q(s, yh |ye , x) be an approximation of the true distribution p(s, yh |ye , x). [sent-391, score-0.416]
47 With a little abuse of notations, we will use q(s, yh ) to denote q(s, yh |ye , x). [sent-392, score-0.416]
48 Take p(s, yh |ye , x) = p(s, yh , ye |x)/p(ye |x) into the above equation and use the non-negativity of KL divergence, we get a lower bound of the log-likelihood log p(ye |x) ≥ ∑ q(s, yh )[log p(s, yh , ye |x) − log q(s, yh )]. [sent-396, score-1.228]
49 s,yh Equivalently, L (Θ) ∑s,yh q(s, yh )[log q(s, yh ) − log p(s, yh , ye |x)] is an upper bound of the negative log-likelihood −L(Θ). [sent-397, score-0.718]
50 1595 Z HU , N IE , Z HANG AND W EN Similarly, the derivatives of L (Θ) with respect to µk are ∂L (Θ) = − ∑ ni j sil s jl ∂µk i jl q(s,yh ) gk (i, j, l, x, ζ) − ∂F∞ . [sent-403, score-0.531]
51 Now, the upper bound is approximated by L (Θ)=F0 − F∞ CFiApp , ≈F0 − Fi = KL(q0 ||p) − KL(qi ||p) where q0 = q(s, yh ) is optimized with observable labels clamped to their values, and q i (s, yh , ye ) is optimized with all variables free starting with q0 . [sent-407, score-0.561]
52 Assume q0 can be factorized as q0 = q(s, yh ) = q(s)q(yh ), and we get KL(q0 ||p) = − log p(s, yh , ye |x) q0 − H(q(s)) − H(q(yh )), (3) where H(p) = − log p p is the entropy of distribution p. [sent-419, score-0.51]
53 i iy il Substitute the above distributions into (3) and keep q(yh ) fixed, then we get KL(q0 ||p) = − log p(s, yh , ye |x) q0 − H(q(s)) + c, where c is a constant. [sent-428, score-0.345]
54 Let the derivative over µil equal zero, and we get µil ∝ exp ∑k λk sil ∑ j s jl ∑k µk sil ∑ j s jl q(s) ni j gk (i, j, l, x)+ y1 y2 y3 q(s) ni j ∑y1 ,y2 ,y3 αi α j αl q(yh ) f k (y1 , y2 , y3 , x, ζ) . [sent-429, score-0.749]
55 Let the derivative over my equal zero, and we get i y1 y2 ni j sil s jl q(s) α j αl q(yh ) fk (y, y1 , y2 , x, ζ)+ ni j s jl sil q(s) αy1 αy2 q(yh ) fk (y1 , y, y2 , x, ζ)+ . [sent-432, score-0.712]
56 The first part is to evaluate the performance of integrated web data extraction models compared with existing decoupled methods. [sent-461, score-0.719]
57 All the experiments are carried out on a real-world web data extraction task—production information extraction. [sent-463, score-0.508]
58 Based on work by Zhai and Liu (2005), we try to align the elements of two adjacent blocks in the same page and extract some global features to help attribute labeling. [sent-519, score-0.36]
59 2 Label Spaces For variables at leaf nodes, we are interested in deciding whether a leaf block (an element) is an attribute value of the object we want to extract. [sent-529, score-0.348]
60 So, we have two types of label spaces—leaf label space for variables at leaf nodes and inner label space for variables at inner nodes. [sent-531, score-0.366]
61 All these labels are general to any web data extraction problem, and they are independent of any specific object type. [sent-539, score-0.559]
62 Object-type Dependent Labels: Between data record blocks and leaf blocks, there are intermediate blocks on a vision-tree. [sent-540, score-0.41]
63 The object-type dependent labels in product information extraction are listed in Table 2 with the format Con *. [sent-546, score-0.335]
64 For the training data, the detail pages are from 61 web sites and the list pages are from 81 web sites. [sent-554, score-0.692]
65 The number of web sites that are found in both list and detail training data is 39. [sent-555, score-0.422]
66 Totally, 58 unique templates are presented in the list training pages and 61 unique templates are presented in the detail training pages. [sent-557, score-0.382]
67 For testing data, Table 3 shows the number of unique web sites where the pages come from and the number of different templates presented in these data. [sent-558, score-0.439]
68 For example, the pages in LDST are from 167 web sites, of which 78 are found in list training data and 52 are found in detail training data. [sent-559, score-0.386]
69 The number of web sites that are found in both list and detail training data is 34. [sent-560, score-0.422]
70 Thus, totally 71 list page web sites and 263 detail page web sites are not seen in the training data. [sent-562, score-0.868]
71 For templates, 83 list page templates and 208 detail page templates are not seen the training data. [sent-563, score-0.522]
72 In DDST, pages from different web sites typically have different templates and thus most templates have 1 document. [sent-566, score-0.572]
73 A block is considered as a correctly detected data record if it contains all the appeared 1601 Z HU , N IE , Z HANG AND W EN Data Sets LDST DDST #Web Site 167 (78/52/34) 268 (2/3/0) #Template 140 (57/0/0) 212 (0/4/0) Table 3: Statistics of the data sets. [sent-569, score-0.32]
74 Evaluation of Integrated Web Data Extraction Models In this section, we report the evaluation results of integrated web data extraction models compared with decoupled models. [sent-576, score-0.719]
75 The results demonstrate that integrated extraction models can achieve significant improvements over decoupled models in both record detection and attribute labeling. [sent-577, score-0.919]
76 We also show the generalization ability of the integrated extraction models. [sent-578, score-0.36]
77 For the integrated extraction model, a webpage is first segmented by VIPS to construct a vision-tree and then HCRFs are used to detect both records and attributes on the vision-tree. [sent-584, score-0.616]
78 To see the separate effect of our approach on record detection and attribute labeling, we first detect data records on the parsed vision-trees using the content features, tree distance, shape distance, and type distance features. [sent-598, score-0.65]
79 We use the same 200 list pages to train a 2D CRF model for extraction on list pages, and use the same 150 detail pages to train another 2D CRF model for extraction on detail pages. [sent-603, score-0.708]
80 In contrast, our approach integrates attribute labeling into block detection and can consider semantics during detecting data records. [sent-628, score-0.452]
81 When not considering the semantics of the elements, H SNG and H S extract more noise blocks compared with H NG or HCRF, so the precisions of record detection decrease by 5. [sent-632, score-0.414]
82 Attribute labeling benefits from good quality records: one reason for this better performance is that attribute labeling can benefit from the good results of record detection. [sent-771, score-0.603]
83 For example, if a detected record is not a data record or misses some important information such as Name, attribute labeling will fail to find the missed information or will find some incorrect information. [sent-772, score-0.712]
84 Global features help attribute labeling: another reason for the improvements in attribute labeling is the incorporation of the global features as in Section 5. [sent-775, score-0.48]
85 3 Generalization Ability We report some empirical results to show the generalization ability of the integrated web data extraction models. [sent-795, score-0.63]
86 We randomly pick 37 templates from LDST and for each template we collect 5 webpages for training and 10 webpages for testing. [sent-796, score-0.443]
87 We can see that the integrated extraction models converge very quickly. [sent-800, score-0.391]
88 As the number of templates increase in the training data, the extraction accuracy becomes higher and the variances become smaller. [sent-801, score-0.371]
89 Results show that DHMRFs can (at least partially) overcome the blocky artifact issue in diverse web data extraction. [sent-809, score-0.393]
90 Instead, as structure selection is integrated with labeling in DHMRFs, the dynamic model can properly group the attributes of the same object, and at the same time separate the attributes of different objects with the help of semantic labels. [sent-879, score-0.551]
91 The less discriminative power of D-Trees causes additional errors in constructing model structures even for the non-intertwined cases, and thus hurts the accuracy of record detection and attribute labeling. [sent-900, score-0.439]
92 , 4) of templates in the testing data are seen in the training data, the results on webpages generated from unseen templates do not change much. [sent-1047, score-0.455]
93 The integrated HCRFs outperform the sequential HCRFs, which take record detection and attribute labeling as two separate steps as described in Section 6. [sent-1054, score-0.693]
94 Conclusions and Future Work In this paper, we proposed an integrated web data extraction paradigm with hierarchical models. [sent-1095, score-0.81]
95 1610 DYNAMIC H IERARCHICAL M ARKOV R ANDOM F IELDS between variables, DHMRFs can potentially address the blocky artifact issue in diverse web data extraction. [sent-1100, score-0.393]
96 We apply the models to a real-world web data extraction task. [sent-1104, score-0.539]
97 Finally, extensive studies of the integrated extraction models in other complicated domains, like extracting researchers’ information (Zhu et al. [sent-1110, score-0.391]
98 ROADRUNNER: Towards automatic data extraction from large web sites. [sent-1152, score-0.508]
99 Simultaneous record detection and attribute labeling in web data extraction. [sent-1299, score-0.841]
100 Dynamic hierarchical Markov random fields and their application to web data extraction. [sent-1307, score-0.45]
wordName wordTfidf (topN-words)
[('dhmrfs', 0.297), ('web', 0.27), ('hcrf', 0.252), ('extraction', 0.238), ('yh', 0.208), ('record', 0.198), ('hierarchical', 0.18), ('hcrfs', 0.168), ('sil', 0.168), ('webpages', 0.155), ('attribute', 0.141), ('jl', 0.138), ('dynamic', 0.137), ('templates', 0.133), ('labeling', 0.132), ('integrated', 0.122), ('contrastive', 0.114), ('fields', 0.111), ('ields', 0.11), ('ldst', 0.107), ('records', 0.106), ('ierarchical', 0.103), ('detection', 0.1), ('zhai', 0.099), ('webpage', 0.097), ('ye', 0.094), ('arkov', 0.087), ('ddst', 0.084), ('zhu', 0.084), ('ie', 0.083), ('hu', 0.079), ('block', 0.079), ('andom', 0.079), ('depta', 0.076), ('desc', 0.076), ('nodes', 0.075), ('hang', 0.074), ('crf', 0.074), ('blocks', 0.074), ('image', 0.07), ('price', 0.07), ('page', 0.07), ('sng', 0.069), ('blocky', 0.065), ('leaf', 0.064), ('intertwined', 0.061), ('sibling', 0.061), ('divergence', 0.06), ('en', 0.06), ('detail', 0.059), ('crfs', 0.058), ('artifact', 0.058), ('decoupled', 0.058), ('list', 0.057), ('semantic', 0.054), ('name', 0.054), ('lerman', 0.053), ('attributes', 0.053), ('liu', 0.052), ('wake', 0.052), ('jun', 0.052), ('labels', 0.051), ('label', 0.051), ('ni', 0.05), ('tag', 0.046), ('gatterbauer', 0.046), ('nie', 0.046), ('wen', 0.046), ('zaiqing', 0.046), ('product', 0.046), ('markov', 0.045), ('dependencies', 0.045), ('yl', 0.044), ('detected', 0.043), ('il', 0.043), ('extract', 0.042), ('tree', 0.041), ('cpts', 0.039), ('font', 0.038), ('sleep', 0.038), ('welling', 0.038), ('gk', 0.037), ('bo', 0.037), ('inner', 0.037), ('sites', 0.036), ('node', 0.036), ('html', 0.035), ('unseen', 0.034), ('element', 0.034), ('content', 0.033), ('kl', 0.033), ('features', 0.033), ('sharing', 0.032), ('eld', 0.032), ('incorporate', 0.031), ('models', 0.031), ('china', 0.031), ('distance', 0.031), ('crawled', 0.03), ('irving', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
2 0.095295817 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni
Abstract: We consider the problem of estimating the value of a multiattribute resource, where the attributes are categorical or discrete in nature and the number of potential attribute vectors is very large. The problem arises in approximate dynamic programming when we need to estimate the value of a multiattribute resource from estimates based on Monte-Carlo simulation. These problems have been traditionally solved using aggregation, but choosing the right level of aggregation requires resolving the classic tradeoff between aggregation error and sampling error. We propose a method that estimates the value of a resource at different levels of aggregation simultaneously, and then uses a weighted combination of the estimates. Using the optimal weights, which minimizes the variance of the estimate while accounting for correlations between the estimates, is computationally too expensive for practical applications. We have found that a simple inverse variance formula (adjusted for bias), which effectively assumes the estimates are independent, produces near-optimal estimates. We use the setting of two levels of aggregation to explain why this approximation works so well. Keywords: hierarchical statistics, approximate dynamic programming, mixture models, adaptive learning, multiattribute resources
3 0.08171311 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
4 0.078332029 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
5 0.069945417 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
6 0.068297789 87 jmlr-2008-Stationary Features and Cat Detection
7 0.067318074 56 jmlr-2008-Magic Moments for Structured Output Prediction
8 0.059676509 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
9 0.058759212 54 jmlr-2008-Learning to Select Features using their Properties
10 0.050598327 71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations
12 0.049586419 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
13 0.048694253 52 jmlr-2008-Learning from Multiple Sources
14 0.04847889 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
15 0.04615961 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
16 0.046010077 58 jmlr-2008-Max-margin Classification of Data with Absent Features
17 0.043731894 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
18 0.043601841 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
19 0.0435686 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
20 0.042663451 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
topicId topicWeight
[(0, 0.204), (1, 0.026), (2, 0.051), (3, 0.013), (4, -0.194), (5, 0.081), (6, 0.087), (7, 0.035), (8, 0.012), (9, 0.012), (10, -0.092), (11, 0.201), (12, -0.173), (13, -0.014), (14, -0.087), (15, -0.285), (16, -0.08), (17, -0.127), (18, 0.094), (19, -0.068), (20, 0.091), (21, -0.016), (22, -0.025), (23, -0.067), (24, 0.153), (25, 0.07), (26, 0.023), (27, -0.093), (28, 0.061), (29, -0.027), (30, -0.023), (31, 0.177), (32, 0.035), (33, -0.078), (34, 0.003), (35, 0.07), (36, 0.104), (37, 0.096), (38, 0.043), (39, -0.011), (40, 0.053), (41, -0.063), (42, 0.081), (43, -0.043), (44, 0.018), (45, -0.048), (46, -0.138), (47, 0.02), (48, 0.04), (49, -0.16)]
simIndex simValue paperId paperTitle
same-paper 1 0.95211679 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
2 0.44048318 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni
Abstract: We consider the problem of estimating the value of a multiattribute resource, where the attributes are categorical or discrete in nature and the number of potential attribute vectors is very large. The problem arises in approximate dynamic programming when we need to estimate the value of a multiattribute resource from estimates based on Monte-Carlo simulation. These problems have been traditionally solved using aggregation, but choosing the right level of aggregation requires resolving the classic tradeoff between aggregation error and sampling error. We propose a method that estimates the value of a resource at different levels of aggregation simultaneously, and then uses a weighted combination of the estimates. Using the optimal weights, which minimizes the variance of the estimate while accounting for correlations between the estimates, is computationally too expensive for practical applications. We have found that a simple inverse variance formula (adjusted for bias), which effectively assumes the estimates are independent, produces near-optimal estimates. We use the setting of two levels of aggregation to explain why this approximation works so well. Keywords: hierarchical statistics, approximate dynamic programming, mixture models, adaptive learning, multiattribute resources
3 0.41998214 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
4 0.39674997 87 jmlr-2008-Stationary Features and Cat Detection
Author: François Fleuret, Donald Geman
Abstract: Most discriminative techniques for detecting instances from object categories in still images consist of looping over a partition of a pose space with dedicated binary classiÄ?Ĺš ers. The efÄ?Ĺš ciency of this strategy for a complex pose, that is, for Ä?Ĺš ne-grained descriptions, can be assessed by measuring the effect of sample size and pose resolution on accuracy and computation. Two conclusions emerge: (1) fragmenting the training data, which is inevitable in dealing with high in-class variation, severely reduces accuracy; (2) the computational cost at high resolution is prohibitive due to visiting a massive pose partition. To overcome data-fragmentation we propose a novel framework centered on pose-indexed features which assign a response to a pair consisting of an image and a pose, and are designed to be stationary: the probability distribution of the response is always the same if an object is actually present. Such features allow for efÄ?Ĺš cient, one-shot learning of pose-speciÄ?Ĺš c classiÄ?Ĺš ers. To avoid expensive scene processing, we arrange these classiÄ?Ĺš ers in a hierarchy based on nested partitions of the pose; as in previous work on coarse-to-Ä?Ĺš ne search, this allows for efÄ?Ĺš cient processing. The hierarchy is then â€?foldedâ€? for training: all the classiÄ?Ĺš ers at each level are derived from one base predictor learned from all the data. The hierarchy is â€?unfoldedâ€? for testing: parsing a scene amounts to examining increasingly Ä?Ĺš ner object descriptions only when there is sufÄ?Ĺš cient evidence for coarser ones. In this way, the detection results are equivalent to an exhaustive search at high resolution. We illustrate these ideas by detecting and localizing cats in highly cluttered greyscale scenes. Keywords: supervised learning, computer vision, image interpretation, cats, stationary features, hierarchical search
5 0.39482537 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
6 0.36268571 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
7 0.34316263 56 jmlr-2008-Magic Moments for Structured Output Prediction
8 0.30118582 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
9 0.28549618 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations
10 0.26152593 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
11 0.26077962 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
12 0.25309238 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
14 0.23821518 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
15 0.23187506 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
16 0.21355884 58 jmlr-2008-Max-margin Classification of Data with Absent Features
17 0.21067883 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
18 0.20856014 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
19 0.19880751 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
20 0.18812586 71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations
topicId topicWeight
[(0, 0.018), (5, 0.032), (9, 0.014), (10, 0.378), (24, 0.01), (31, 0.017), (40, 0.042), (54, 0.036), (58, 0.035), (66, 0.046), (76, 0.031), (78, 0.011), (88, 0.079), (92, 0.055), (94, 0.049), (99, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.77205694 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
2 0.34501642 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
3 0.32710937 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
Author: Shann-Ching Chen, Geoffrey J. Gordon, Robert F. Murphy
Abstract: In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as Markov random fields or factor graphs can represent arbitrary dependences among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models we can use. In this paper, we study a new class of potential functions, which we call decomposable k-way potentials, and provide efficient algorithms for computing messages from these potentials during belief propagation. We believe these new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. We discuss three instances of decomposable potentials: the associative Markov network potential, the nested junction tree, and a new type of potential which we call the voting potential. We use these potentials to classify images of protein subcellular location patterns in groups of cells. Classifying subcellular location patterns can help us answer many important questions in computational biology, including questions about how various treatments affect the synthesis and behavior of proteins and networks of proteins within a cell. Our new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy. Keywords: factor graphs, approximate inference algorithms, structured classification, protein subcellular location patterns, location proteomics
5 0.3268787 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
6 0.32669464 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
7 0.32633689 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
8 0.32569316 57 jmlr-2008-Manifold Learning: The Price of Normalization
9 0.32566634 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
10 0.32500219 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
11 0.32471794 86 jmlr-2008-SimpleMKL
12 0.32182589 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
13 0.32152739 9 jmlr-2008-Active Learning by Spherical Subdivision
14 0.32145101 83 jmlr-2008-Robust Submodular Observation Selection
15 0.31887811 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
16 0.31850201 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
17 0.31769311 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
18 0.31735003 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
19 0.31700492 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
20 0.3145 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers