jmlr jmlr2008 jmlr2008-22 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gemma C. Garriga, Petra Kralj, Nada Lavrač
Abstract: Closed sets have been proven successful in the context of compacted data representation for association rule learning. However, their use is mainly descriptive, dealing only with unlabeled data. This paper shows that when considering labeled data, closed sets can be adapted for classification and discrimination purposes by conveniently contrasting covering properties on positive and negative examples. We formally prove that these sets characterize the space of relevant combinations of features for discriminating the target class. In practice, identifying relevant/irrelevant combinations of features through closed sets is useful in many applications: to compact emerging patterns of typical descriptive mining applications, to reduce the number of essential rules in classification, and to efficiently learn subgroup descriptions, as demonstrated in real-life subgroup discovery experiments on a high dimensional microarray data set. Keywords: rule relevancy, closed sets, ROC space, emerging patterns, essential rules, subgroup discovery
Reference: text
sentIndex sentText sentNum sentScore
1 This paper shows that when considering labeled data, closed sets can be adapted for classification and discrimination purposes by conveniently contrasting covering properties on positive and negative examples. [sent-10, score-0.289]
2 We formally prove that these sets characterize the space of relevant combinations of features for discriminating the target class. [sent-11, score-0.263]
3 Keywords: rule relevancy, closed sets, ROC space, emerging patterns, essential rules, subgroup discovery 1. [sent-13, score-0.542]
4 Introduction Rule discovery in data mining mainly explores unlabeled data and the focus resides on finding itemsets that satisfy a minimum support constraint (namely frequent itemsets), and from them, constructing rules over a certain confidence. [sent-14, score-0.59]
5 Algorithms for contrast set mining are STUCCO (Bay and Pazzani, 2001), and also an innovative approach presented in the form of mining emerging patterns (Dong and Li, 1999). [sent-30, score-0.291]
6 Indeed, we can see all these tasks on labeled data (learning classification rules, subgroup discovery, or contrast set mining) as a rule induction problem, that is, a process of searching a space of concept descriptions (hypotheses in the form of rule antecedents). [sent-36, score-0.281]
7 c c Searching for relevant descriptions for rule construction has been extensively addressed in descriptive data mining as well. [sent-42, score-0.295]
8 A useful insight was provided by closure systems (Carpineto and Romano, 2004; Ganter and Wille, 1998), aimed at compacting the whole space of descriptions into a reduced system of relevant sets that formally conveys the same information as the complete space. [sent-43, score-0.277]
9 The approach has successfully evolved towards mining closed itemsets (see, for example, Pasquier et al. [sent-44, score-0.515]
10 Intuitively, closed itemsets can be seen as maximal sets of items/features covering a maximal set of examples. [sent-46, score-0.459]
11 Despite its success in the data mining community, the use of closed sets is mainly descriptive. [sent-47, score-0.283]
12 e To the best of our knowledge, the notion of closed sets has not yet been exported to labeled data, nor used in the learning tasks for labeled data described above. [sent-49, score-0.267]
13 (1999) and Lavraˇ and Gamberger (2005), we formally c c justify that the obtained closed sets characterize the space of relevant combinations of features for discriminating the target class. [sent-52, score-0.464]
14 In practice, our notion of closed sets in the labeled context (described in Sections 3 and 4) can be naturally interpreted as non-redundant descriptive rules (discriminating the target class) in the ROC space (Section 5). [sent-53, score-0.392]
15 2), and finally, to learn descriptions for subgroup discovery on potato microarray data (Section 6. [sent-57, score-0.341]
16 We consider two-class learning problems where the set of examples E is divided into positives (P, target-class examples identified by label +) and negatives (N, labeled by −), and E = P ∪ N. [sent-73, score-0.276]
17 Later, we will see that some combinations of features X ⊆ F produce more relevant antecedents than others for the rules X → +. [sent-77, score-0.276]
18 We will do it by integrating the notion of closed itemsets and the concept of feature relevancy proposed in previous works. [sent-79, score-0.554]
19 1 Closed Itemsets From the practical point of view of data mining algorithms, closed itemsets are the largest sets (w. [sent-81, score-0.515]
20 In the example of Table 2, the itemset corresponding to {Age=young} is not closed because it can be extended to the maximal set {Age=young, Astigmatism=no, Tear=normal} that has the same support in this data. [sent-90, score-0.303]
21 Notice that by treating positive examples separately, the positive label will be already implicit in the closed itemsets mined on the target class data. [sent-91, score-0.517]
22 no no no no no no yes no yes no yes no yes yes no no yes no yes yes yes yes yes yes Tear prod. [sent-97, score-0.708]
23 Id 1 2 3 4 5 Age young young pre-presbyopic pre-presbyopic presbyopic Spectacle prescription myope hypermetrope myope hypermetrope hypermetrope Astig. [sent-99, score-1.221]
24 normal normal normal normal normal Class + + + + + Table 2: The set of positive examples when class soft of the contact lens data of Table 1 is selected as the target class. [sent-101, score-0.53]
25 constructing the closure system of items on our positive examples and use this system to study the structural properties of the closed sets to discriminate the implicit label. [sent-105, score-0.35]
26 Many efficient algorithms have been proposed for discovering closed itemsets over a certain minimum support threshold; see a compendium of them in Goethals and Zaki (2004). [sent-106, score-0.464]
27 The foundations of closed itemsets are based on the definition of a closure operator on a lattice of items (Carpineto and Romano, 2004; Ganter and Wille, 1998). [sent-107, score-0.646]
28 The standard closure operator Γ for items acts as follows: the closure Γ(X) of a set of items X ⊆ F includes all items that are present in all examples having all items in X. [sent-108, score-0.374]
29 From the formal point of view of Γ, closed sets are those coinciding with their closure, that is, for X ⊆ F, X is closed iff Γ(X) = X. [sent-110, score-0.429]
30 Intensive work has focused on identifying which collection of generators is good to ensure that all closed sets can be produced. [sent-113, score-0.278]
31 Moreover, closed sets formalized with operator Γ are exactly those sets obtained in closed ´ set mining process and defined above, which present many advantages (see, for example, Balc azar and Baixeries, 2003; Cr´ milleux and Boulicaut, 2002). [sent-123, score-0.516]
32 e 563 ˇ G ARRIGA , K RALJ AND L AVRA C Closed itemsets are lossless in the sense that they uniquely determine the set of all frequent itemsets and their exact support (cf. [sent-124, score-0.569]
33 set-theoretic inclusion) and there is no other intermediate closed itemset in the lattice. [sent-129, score-0.272]
34 Figure 1 shows the lattice of closed itemsets obtained from data from Table 2. [sent-132, score-0.497]
35 Notice that all closed itemsets with the same support cover a different subset of transactions of the original data. [sent-134, score-0.493]
36 In practice, such exponential lattices are not completely constructed, as only a list of closed itemsets over a certain minimum support suffices for practical purposes. [sent-135, score-0.464]
37 Therefore, instead of closed sets one needs to talk about frequent closed sets, that is, those closed sets over the minimum support constraint given by the user. [sent-136, score-0.736]
38 Also notice the difference of frequent closed sets from the popular concept of maximal frequent sets (see, for example, Tan et al. [sent-137, score-0.349]
39 Obviously, imposing a minimum support constraint will eliminate the largest closed sets whose support is typically very low. [sent-139, score-0.291]
40 In the following we consider a theoretical framework with all closed sets; in practice though, we will need a minimum support constraint to consider only the frequent ones. [sent-142, score-0.334]
41 To illustrate this notion we take the data of Table 1: if examples of class none form our positives and the rest of examples are considered negative, then the feature Tear=reduced covers Age=young, hence making this last feature irrelevant for the discrimination of the class none. [sent-150, score-0.333]
42 Because all the true positives of f are also covered by f , it is true that TP( f ) = TP( f , f ); similarly, because all the false positives of f are also covered by f we have FP( f ) = FP( f , f ). [sent-167, score-0.44]
43 This will provide the desired mapping from relevant sets of features to the lattice of closed itemsets constructed on target class examples. [sent-185, score-0.691]
44 Again, because we need to formalize our arguments against positive and negative examples separately, we will use Γ + or Γ− for the closure of itemsets on P or N respectively. [sent-186, score-0.343]
45 , when soft is the target class: the rule Spectacle=myope → + is not relevant because at least the rule {Astigmatism=no, Tear=normal} → + will be more relevant. [sent-207, score-0.267]
46 As all the true positives of Y are also covered by X, it is true that TP(Y ) = TP(X ∪ Y ); similarly, as all the false positives of X are also covered by Y we have that FP(X) = FP(X ∪Y ). [sent-217, score-0.44]
47 In the language of rules, Lemma 6 implies that when a set of features X ⊆ F is more relevant than Y ⊆ F, then rule Y → + is less relevant than rule X → + for discriminating the target class. [sent-221, score-0.454]
48 Yet, the interestingness of Definition 5 is that we can use this new concept to study the relevancy of itemsets (discovered in the mining process) for discrimination problems. [sent-224, score-0.464]
49 Also, it can be immediately seen that if X is more relevant than Y in the positives, then Y will be more relevant than X in the negatives (by just reversing Definition 5). [sent-225, score-0.28]
50 Next subsection characterizes the role of closed itemsets to find relevant sets of features for discrimination. [sent-226, score-0.589]
51 1 Closed Sets for Discrimination Together with the result of Lemma 6, it can be shown that only closed itemsets mined in the set of positive examples suffice for discrimination. [sent-230, score-0.479]
52 , 2004), frequent itemsets with very small minimal support c constraint are initially mined and subsequently post-processed in order to find the most suitable rules 2. [sent-238, score-0.499]
53 We are aware that some generators Y of a closed set X might be exactly equivalent to X in terms of TP and FP, thus forming equivalence classes of rules (i. [sent-239, score-0.366]
54 The result of this theorem characterizes closed sets in the positives as those representatives of relevant rules; so, any set which is not closed can be discarded, and thus, efficient closed mining algorithms can be employed for discrimination purposes. [sent-242, score-0.992]
55 The new result presented here states that not all frequent itemsets are necessary: as shown in Theorem 7 only the closed sets have the potential to be relevant. [sent-245, score-0.507]
56 However, Theorem 7 simply states that those itemsets which are not closed in the set of positive examples cannot form a relevant rule to discriminate the target class, thus they do not correspond to a relevant combination of features. [sent-249, score-0.724]
57 In other words, closed itemsets suffice but some of them might not be necessary to discriminate the target class. [sent-250, score-0.471]
58 It might well be that a closed itemset is irrelevant with respect to another closed itemset in the system. [sent-251, score-0.58]
59 The next section is dedicated to the task of reducing the closure system of itemsets to characterize the final space of relevant sets of features. [sent-254, score-0.448]
60 Characterizing the Space of Relevant Sets of Features This section studies how the dual closure system on the negative examples is used to reduce the lattice of closed sets on the positives. [sent-256, score-0.376]
61 This reduction will characterize a complete space of relevant sets of features for discriminating the target class. [sent-257, score-0.263]
62 Remark 8 Given two different closed sets on the positives X and X such that X X and X X (i. [sent-259, score-0.374]
63 Remark 9 Given two closed sets on the positives X and X with X ⊂ X , we have by construction that TP(X ) ⊂ TP(X) and FP(X ) ⊆ FP(X) (from Proposition 2). [sent-264, score-0.374]
64 Notice that because X and X are different closed sets in the positives, TP(X ) is necessarily a proper subset of TP(X); however, regarding the coverage of false positives, this inclusion is not necessarily proper. [sent-265, score-0.27]
65 To illustrate Remark 9 we use the lattice of closed itemsets in Figure 1. [sent-266, score-0.497]
66 By construction the closed set {Spectacle=myope, Astigmatism=no, Tear=normal} from Figure 1 covers fewer positives than the proper predecessor {Astigmatism=no, Tear=normal}. [sent-267, score-0.407]
67 Remark 9 points out that two different closed sets in the positives, yet being one included in the other, may end up covering exactly the same set of false positives. [sent-270, score-0.269]
68 Theorem 10 Let X ⊆ F and X ⊆ F be two different closed sets in the positives such that X ⊂ X . [sent-275, score-0.374]
69 Thus, by Theorem 10 we can reduce the closure system constructed on the positives by discarding irrelevant nodes: if two closed itemsets are connected by an ascending/descending path on the lattice of positives (i. [sent-283, score-0.99]
70 Finally, after Theorem 7 and Theorem 10, we can characterize the space of relevant sets of features for discriminating the selected target class as follows. [sent-288, score-0.263]
71 Definition 11 (Space of relevant sets of features) The space of relevant combinations of features for discriminating the target class is defined as those sets X for which it holds that: Γ + (X) = X and there is no other closed set Γ+ (X ) = X such that Γ− (X ) = Γ− (X). [sent-289, score-0.569]
72 The four closed sets forming the space of relevant sets of features for the class soft are shown in Table 3. [sent-294, score-0.395]
73 The space of relevant combinations defines exhaustively all the relevant antecedents for discriminating the target class. [sent-299, score-0.349]
74 1 Shortest Representation of a Relevant Set Based on Theorem 7 we know that generators Y of a closed set X are characterized to cover exactly the same positive examples, and at least the same negative examples. [sent-305, score-0.307]
75 However, we have FP(X) ⊆ FP(Y ) for Y generator of X; so, it might happen that some generators Y are equivalent to their closed set X in that they cover exactly the same true positives and also the same false positives. [sent-312, score-0.575]
76 Therefore, it would be only necessary to check if generators cover the same false positives than its closure to check equivalence. [sent-316, score-0.432]
77 For example, this may depend on a minimum-length criterion of the final classification rules: a generator Y equivalent to a closed set X satisfies by construction that Y ⊂ X, so Y → + is shorter than the rule X → +. [sent-319, score-0.297]
78 Then, the minimal equivalent generators of a closed itemset X naturally correspond to the minimal representation of the relevant rule X → +. [sent-320, score-0.497]
79 It is well-known that minimal generators of a closed set X can be computed by traversing the hypergraph of differences between X and their proper predecessors in the system (see, for example, Pfaltz and Taylor, 2002). [sent-327, score-0.278]
80 The ROC space is appropriate for measuring the quality of rules since rules with the best covering properties are placed in the top left corner, while rules that have similar distribution of covered positives and negatives as the distribution in the entire data set are close to the main diagonal. [sent-331, score-0.559]
81 , Xn } of frequent closed itemsets from the target class (Theorem 7). [sent-372, score-0.545]
82 Schematically, for any closed set Xi ∈ S, if there exists another closed set X j ∈ S such that both have the same support in the negatives and X j ⊂ Xi , then Xi is removed. [sent-378, score-0.503]
83 As discussed above, the minimum support constraint on the first phase will tend to prune too long closed sets and this might have an impact in the application. [sent-452, score-0.26]
84 Still we find important to point out that the notion of relevancy explored in the paper prefers typically the shortest closed sets. [sent-457, score-0.322]
85 More specifically, EPs are itemsets whose growth rates (the ratio of support from one class to the other, that is, TPr of the FPr pattern) are larger than a user-specified threshold. [sent-464, score-0.292]
86 When data is structurally redundant, compression factors are higher since many frequent sets are redundant with respect to the closed sets. [sent-473, score-0.299]
87 This latter work also identifies condensed representations of EPs from closed sets mined in the whole database. [sent-477, score-0.273]
88 Technically, they correspond to mining all frequent itemsets and removing those sets X such that there exists another frequent Y with Y ⊂ X and having both the same support in positives and negatives. [sent-483, score-0.666]
89 Note that essential rules are not pruned by growth rate threshold, and this is why their number is usually higher than the number of emerging patterns shown in previous subsection. [sent-486, score-0.268]
90 (2004) that microarray data analysis problems can be approached also through subgroup discovery, where the goal is to find a set of subgroup descriptions (a rule set) for the target class, that preferably ˇ has a low number of rules while each rule has high coverage and accuracy (Lavra c et al. [sent-543, score-0.57]
91 We used the RelSets algorithm to analyze the differences between gene expression levels characteristic for virus sensitive potato transgenic lines, discriminating them from virus resistant potato transgenic lines and vice versa. [sent-557, score-0.534]
92 Rule relevancy filtering according to Definition 5, filtered the rules to just one relevant rule with a 100% true positive rate and a 0% false positive rate for each class. [sent-560, score-0.399]
93 However, selected gene expression levels after 12 hours and the comparison of gene expression difference (12-8) characterize the resistance to the infection with potato virus for the transgenic lines tested. [sent-568, score-0.339]
94 Conclusions We have presented a theoretical framework that, based on the covering properties of closed itemsets, characterizes those sets of features that are relevant for discrimination. [sent-573, score-0.383]
95 We call them closed sets for labeled data, since they keep similar structural properties of classical closed sets, yet taking into account the positive and negative labels of examples. [sent-574, score-0.435]
96 In practice the approach shows major advantages for compacting emerging patterns and essential rules and solving hard subgroup discovery problems. [sent-578, score-0.423]
97 The application to potato microarray data, where the goal was to find differences between virus resistant and virus sensitive potato transgenic lines, shows that our approach is not only fast, but also returns a small set of rules that are meaningful and easy to interpret by domain experts. [sent-580, score-0.507]
98 Future work will be devoted to adapting efficient algorithms of emerging patterns by Dong and Li (1999) for the discovery of the presented relevant sets. [sent-581, score-0.287]
99 Mining minimal non-redundant association rules using frequent closed itemsets. [sent-613, score-0.388]
100 Application of closed sc c itemset mining for class labeled data in functional genomics. [sent-769, score-0.387]
wordName wordTfidf (topN-words)
[('tear', 0.296), ('fp', 0.28), ('astigmatism', 0.266), ('tp', 0.266), ('lavra', 0.251), ('itemsets', 0.232), ('supp', 0.213), ('myope', 0.205), ('closed', 0.201), ('positives', 0.173), ('spectacle', 0.159), ('hypermetrope', 0.152), ('relsets', 0.129), ('subgroup', 0.129), ('young', 0.128), ('relevancy', 0.121), ('closure', 0.111), ('age', 0.111), ('eps', 0.109), ('gamberger', 0.106), ('relevant', 0.105), ('presbyopic', 0.099), ('arriga', 0.091), ('ralj', 0.091), ('emerging', 0.09), ('rules', 0.088), ('avra', 0.084), ('potato', 0.084), ('mining', 0.082), ('normal', 0.081), ('generators', 0.077), ('losed', 0.076), ('frequent', 0.074), ('itemset', 0.071), ('negatives', 0.07), ('discriminating', 0.069), ('abeled', 0.064), ('lattice', 0.064), ('none', 0.062), ('yes', 0.059), ('virus', 0.058), ('roc', 0.058), ('ets', 0.058), ('discovery', 0.055), ('transgenic', 0.053), ('zaki', 0.053), ('generator', 0.053), ('features', 0.051), ('bastide', 0.046), ('boulicaut', 0.046), ('infection', 0.046), ('mined', 0.046), ('pasquier', 0.046), ('rule', 0.043), ('sd', 0.042), ('resistant', 0.042), ('false', 0.042), ('microarray', 0.04), ('ltered', 0.039), ('items', 0.038), ('cf', 0.038), ('extensivity', 0.038), ('kralj', 0.038), ('taouil', 0.038), ('soft', 0.038), ('target', 0.038), ('dong', 0.037), ('patterns', 0.037), ('irrelevant', 0.036), ('fpr', 0.035), ('covers', 0.033), ('descriptions', 0.033), ('labeled', 0.033), ('gene', 0.033), ('hours', 0.032), ('antecedents', 0.032), ('milleux', 0.032), ('descriptive', 0.032), ('support', 0.031), ('garriga', 0.03), ('kav', 0.03), ('lenses', 0.03), ('soulet', 0.03), ('discrimination', 0.029), ('growth', 0.029), ('ljubljana', 0.029), ('cover', 0.029), ('constraint', 0.028), ('reduced', 0.028), ('coverage', 0.027), ('iff', 0.027), ('condensed', 0.026), ('lens', 0.026), ('covered', 0.026), ('covering', 0.026), ('association', 0.025), ('compression', 0.024), ('essential', 0.024), ('slovenia', 0.023), ('contact', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 22 jmlr-2008-Closed Sets for Labeled Data
Author: Gemma C. Garriga, Petra Kralj, Nada Lavrač
Abstract: Closed sets have been proven successful in the context of compacted data representation for association rule learning. However, their use is mainly descriptive, dealing only with unlabeled data. This paper shows that when considering labeled data, closed sets can be adapted for classification and discrimination purposes by conveniently contrasting covering properties on positive and negative examples. We formally prove that these sets characterize the space of relevant combinations of features for discriminating the target class. In practice, identifying relevant/irrelevant combinations of features through closed sets is useful in many applications: to compact emerging patterns of typical descriptive mining applications, to reduce the number of essential rules in classification, and to efficiently learn subgroup descriptions, as demonstrated in real-life subgroup discovery experiments on a high dimensional microarray data set. Keywords: rule relevancy, closed sets, ROC space, emerging patterns, essential rules, subgroup discovery
2 0.24257721 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
Author: Andreas Christmann, Arnout Van Messem
Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines
3 0.049803484 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
4 0.035697468 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
Author: Xing Sun, Andrew B. Nobel
Abstract: Binary matrices, and their associated submatrices of 1s, play a central role in the study of random bipartite graphs and in core data mining problems such as frequent itemset mining (FIM). Motivated by these connections, this paper addresses several statistical questions regarding submatrices of 1s in a random binary matrix with independent Bernoulli entries. We establish a three-point concentration result, and a related probability bound, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms, which search only for submatrices of 1s, cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. 1 Keywords: frequent itemset mining, bipartite graph, biclique, submatrix of 1s, statistical significance
5 0.033037826 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
6 0.030735658 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
7 0.030359503 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
8 0.030174311 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
9 0.028949218 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
10 0.028488513 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
11 0.027654946 58 jmlr-2008-Max-margin Classification of Data with Absent Features
12 0.026533982 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
13 0.024208937 87 jmlr-2008-Stationary Features and Cat Detection
14 0.023519954 9 jmlr-2008-Active Learning by Spherical Subdivision
15 0.021317208 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
16 0.021131141 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
17 0.020335266 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
18 0.02023185 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
19 0.019322854 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
20 0.018581919 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
topicId topicWeight
[(0, 0.113), (1, -0.012), (2, -0.018), (3, -0.049), (4, 0.041), (5, 0.01), (6, 0.094), (7, -0.093), (8, 0.138), (9, 0.236), (10, -0.278), (11, -0.281), (12, -0.321), (13, -0.126), (14, -0.002), (15, 0.241), (16, 0.019), (17, 0.067), (18, -0.166), (19, -0.086), (20, 0.255), (21, -0.023), (22, 0.025), (23, -0.031), (24, 0.057), (25, 0.074), (26, 0.034), (27, 0.113), (28, 0.068), (29, -0.062), (30, 0.074), (31, 0.064), (32, 0.128), (33, -0.102), (34, 0.041), (35, 0.051), (36, 0.131), (37, -0.09), (38, 0.038), (39, 0.061), (40, 0.078), (41, -0.063), (42, 0.095), (43, -0.019), (44, 0.066), (45, 0.013), (46, 0.048), (47, 0.004), (48, -0.021), (49, -0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.96378815 22 jmlr-2008-Closed Sets for Labeled Data
Author: Gemma C. Garriga, Petra Kralj, Nada Lavrač
Abstract: Closed sets have been proven successful in the context of compacted data representation for association rule learning. However, their use is mainly descriptive, dealing only with unlabeled data. This paper shows that when considering labeled data, closed sets can be adapted for classification and discrimination purposes by conveniently contrasting covering properties on positive and negative examples. We formally prove that these sets characterize the space of relevant combinations of features for discriminating the target class. In practice, identifying relevant/irrelevant combinations of features through closed sets is useful in many applications: to compact emerging patterns of typical descriptive mining applications, to reduce the number of essential rules in classification, and to efficiently learn subgroup descriptions, as demonstrated in real-life subgroup discovery experiments on a high dimensional microarray data set. Keywords: rule relevancy, closed sets, ROC space, emerging patterns, essential rules, subgroup discovery
2 0.78202528 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
Author: Andreas Christmann, Arnout Van Messem
Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines
3 0.15781598 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
Author: Xing Sun, Andrew B. Nobel
Abstract: Binary matrices, and their associated submatrices of 1s, play a central role in the study of random bipartite graphs and in core data mining problems such as frequent itemset mining (FIM). Motivated by these connections, this paper addresses several statistical questions regarding submatrices of 1s in a random binary matrix with independent Bernoulli entries. We establish a three-point concentration result, and a related probability bound, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms, which search only for submatrices of 1s, cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. 1 Keywords: frequent itemset mining, bipartite graph, biclique, submatrix of 1s, statistical significance
4 0.14424913 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
Author: Guy Lebanon, Yi Mao
Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method. Keywords: ranked data, partially ordered sets, kernel smoothing
5 0.13031124 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
6 0.1160245 54 jmlr-2008-Learning to Select Features using their Properties
7 0.1077687 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
8 0.10635126 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
9 0.10507406 87 jmlr-2008-Stationary Features and Cat Detection
10 0.10387394 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
11 0.10096633 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
12 0.096159793 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
13 0.094752252 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
14 0.089988038 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
15 0.089281633 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
16 0.086815245 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
17 0.086471505 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
18 0.08462245 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
19 0.082866274 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
20 0.08090046 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
topicId topicWeight
[(0, 0.02), (5, 0.024), (40, 0.041), (54, 0.021), (58, 0.027), (66, 0.03), (76, 0.056), (86, 0.533), (88, 0.058), (92, 0.03), (94, 0.03), (99, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.81961346 22 jmlr-2008-Closed Sets for Labeled Data
Author: Gemma C. Garriga, Petra Kralj, Nada Lavrač
Abstract: Closed sets have been proven successful in the context of compacted data representation for association rule learning. However, their use is mainly descriptive, dealing only with unlabeled data. This paper shows that when considering labeled data, closed sets can be adapted for classification and discrimination purposes by conveniently contrasting covering properties on positive and negative examples. We formally prove that these sets characterize the space of relevant combinations of features for discriminating the target class. In practice, identifying relevant/irrelevant combinations of features through closed sets is useful in many applications: to compact emerging patterns of typical descriptive mining applications, to reduce the number of essential rules in classification, and to efficiently learn subgroup descriptions, as demonstrated in real-life subgroup discovery experiments on a high dimensional microarray data set. Keywords: rule relevancy, closed sets, ROC space, emerging patterns, essential rules, subgroup discovery
2 0.19121705 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
Author: Xing Sun, Andrew B. Nobel
Abstract: Binary matrices, and their associated submatrices of 1s, play a central role in the study of random bipartite graphs and in core data mining problems such as frequent itemset mining (FIM). Motivated by these connections, this paper addresses several statistical questions regarding submatrices of 1s in a random binary matrix with independent Bernoulli entries. We establish a three-point concentration result, and a related probability bound, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms, which search only for submatrices of 1s, cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. 1 Keywords: frequent itemset mining, bipartite graph, biclique, submatrix of 1s, statistical significance
3 0.19120537 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
4 0.18951397 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
5 0.18789548 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
6 0.18655294 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
8 0.18283416 58 jmlr-2008-Max-margin Classification of Data with Absent Features
9 0.18198553 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
10 0.18177496 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
11 0.17899391 57 jmlr-2008-Manifold Learning: The Price of Normalization
12 0.17895657 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
13 0.17778586 56 jmlr-2008-Magic Moments for Structured Output Prediction
14 0.17676744 83 jmlr-2008-Robust Submodular Observation Selection
15 0.17660594 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
16 0.17507261 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
17 0.17352447 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
18 0.17334181 9 jmlr-2008-Active Learning by Spherical Subdivision
19 0.17269881 86 jmlr-2008-SimpleMKL
20 0.17262882 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration