nips nips2009 nips2009-71 knowledge-graph by maker-knowledge-mining

71 nips-2009-Distribution-Calibrated Hierarchical Classification


Source: pdf

Author: Ofer Dekel

Abstract: While many advances have already been made in hierarchical classification learning, we take a step back and examine how a hierarchical classification problem should be formally defined. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classification loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated definition of statistical risk for hierarchical classification, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classification to cost-sensitive classification.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract While many advances have already been made in hierarchical classification learning, we take a step back and examine how a hierarchical classification problem should be formally defined. [sent-2, score-0.41]

2 We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. [sent-3, score-0.995]

3 We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classification loss function. [sent-5, score-0.357]

4 In this paper, we present a new calibrated definition of statistical risk for hierarchical classification, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classification to cost-sensitive classification. [sent-8, score-0.807]

5 Examples of popular label taxonomies are the ODP taxonomy of web pages [2], the gene ontology [6], and the LCC ontology of book topics [1]. [sent-13, score-1.154]

6 A taxonomy is a hierarchical structure over labels, where some labels define very general concepts, and other labels define more specific specializations of those general concepts. [sent-14, score-1.035]

7 A taxonomy of document topics could include the labels MUSIC, CLAS SICAL MUSIC, and POPULAR MUSIC, where the last two are special cases of the first. [sent-15, score-0.794]

8 Some label taxonomies form trees (each label has a single parent) while others form directed acyclic graphs. [sent-16, score-0.563]

9 When a label taxonomy is given alongside a training set, the multiclass classification problem is often called a hierarchical classification problem. [sent-17, score-1.122]

10 The label taxonomy defines a structure over the multiclass problem, and this structure should be used both in the formal definition of the hierarchical classification problem, and in the design of learning algorithms to solve this problem. [sent-18, score-1.154]

11 Most hierarchical classification learning algorithms treat the taxonomy as an indisputable definitive model of the world, never questioning its accuracy. [sent-19, score-0.835]

12 However, most taxonomies are authored by human editors and subjective matters of style and taste play a major role in their design. [sent-20, score-0.194]

13 Many arbitrary decisions go into the design of a taxonomy, and when multiple editors are involved, these arbitrary decisions are made inconsistently. [sent-21, score-0.223]

14 Arbitrary decisions that go into the taxonomy design can have a significant influence on the outcome of the learning algorithm [19]. [sent-23, score-0.752]

15 1 The arbitrary factor in popular label taxonomies is a well-known phenomenon. [sent-25, score-0.375]

16 [17] gives the example of the Library of Congress Classification system (LCC), a widely adopted and constantly updated taxonomy of “all knowledge”, which includes the category WORLD HISTORY and four of its direct subcategories: ASIA, AFRICA, NETHERLANDS, and BALKAN PENINSULA. [sent-26, score-0.651]

17 The Dewey Decimal Classification (DDC), another widely accepted taxonomy of “all knowledge”, defines ten main classes, each has exactly ten subclasses, and each of those again has exactly ten subclasses. [sent-28, score-0.735]

18 The ODP taxonomy of web-page topics is optimized for navigability rather than informativeness, and is therefore very flat and often unbalanced. [sent-31, score-0.651]

19 As a result, two of the direct children of the label GAMES are VIDEO GAMES (with over 42, 000 websites listed) and PAPER AND PENCIL GAMES (with only 32 websites). [sent-32, score-0.239]

20 These examples are not intended to show that these useful taxonomies are flawed, they merely demonstrate the arbitrary subjective aspect of their design. [sent-33, score-0.236]

21 Some older approaches to hierarchical classification do not use the taxonomy in the definition of the classification problem [12, 13, 18, 9, 16]. [sent-35, score-0.835]

22 Namely, these approaches consider all classification mistakes to be equally bad, and use the taxonomy only to the extent that it reduces computational complexity and the number of classification mistakes. [sent-36, score-0.724]

23 When this interpretation of the taxonomy can be made, ignoring it is effectively wasting a valuable signal in the problem input. [sent-38, score-0.651]

24 For example, [8] define the loss of predicting a label u when the correct label is y as the number of edges along the path between the two labels in the taxonomy graph. [sent-39, score-1.359]

25 Additionally, a taxonomy provides a very natural framework for balancing the tradeoff between specificity and accuracy in classification. [sent-40, score-0.651]

26 Ideally, we would like our classifier to assign the most specific label possible to an instance, and the loss function should reward it adequately for doing so. [sent-41, score-0.321]

27 However, when a specific label cannot be assigned with sufficiently high confidence, it is often better to fall-back on a more general correct label than it is to assign an incorrect specific label. [sent-42, score-0.402]

28 For example, classifying a document on JAZZ as the broader topic MUSIC is better than classifying it as the more specific yet incorrect topic COUNTRY MUSIC. [sent-43, score-0.211]

29 The graph-distance based loss function introduced by [8] captures both of the ideas mentioned above, but it is very sensitive to arbitrary choices that go into the taxonomy design. [sent-45, score-0.844]

30 We can make the difference between the two outcomes arbitrarily large by making some regions of the taxonomy very deep and other regions very flat. [sent-48, score-0.651]

31 Additionally, we note that the simple graph-distance based loss works best when the taxonomy is balanced, namely, when all of the splits in the taxonomy convey roughly the same amount of information. [sent-49, score-1.466]

32 If the correct label is NON - VIVALDI and our classifier predicts the more general label CLASSICAL MUSIC, the loss should be small, since the two labels are essentially equivalent. [sent-52, score-0.622]

33 On the other hand, if the correct label is VIVALDI then predicting CLASSICAL MUSIC should incur a larger loss, since important detail was excluded. [sent-53, score-0.239]

34 On the other hand, we don’t want arbitrary choices and unbalanced splits in the taxonomy to have a significant effect on the outcome. [sent-56, score-0.834]

35 Our proposed solution is to leave the taxonomy structure as-is, and to stick with a graph-distance based loss, but to introduce non-uniform edge weights. [sent-58, score-0.704]

36 Namely, the loss of predicting u when the true label is y is defined as the sum of edge-weights along the shortest path from u to y. [sent-59, score-0.407]

37 We use the underlying distribution over labels to set the edge 2 Figure 1: Two equally-reasonable label taxonomies. [sent-60, score-0.354]

38 Note the subjective decision to include/exclude the label ROCK, and note the unbalanced split of CLASSICAL to the small class VIVALDI and the much larger class NON - VIVALDI. [sent-61, score-0.308]

39 weights in a way that adds balance to the taxonomy and compensates for certain arbitrary design choices. [sent-62, score-0.725]

40 The weight of an edge between a label u and its parent u′ is the log-probability of observing the label u given that the example is also labeled by u′ . [sent-64, score-0.564]

41 A binomial-type estimator essentially counts things (such as mistakes), while a geometric-type estimator measures the amount of time that passes before something occurs. [sent-71, score-0.308]

42 Since empirical estimation is the basis of supervised machine learning, we can now extrapolate hierarchical learning algorithms from our unbiased estimation technique. [sent-74, score-0.378]

43 Specifically, we present a reduction from hierarchical classification to cost-sensitive multiclass classification, which is based on our new geometric-type estimator. [sent-75, score-0.314]

44 Let X be an instance space and let T be a taxonomy of labels. [sent-87, score-0.651]

45 T is formally defined as the pair (U, π), where U is a finite set of labels and π is the function that specifies the parent of each label in U. [sent-89, score-0.452]

46 Specifically, we assume that U contains the special label ALL , and that all other labels in U are special cases of ALL . [sent-91, score-0.301]

47 π : U → U is a function that defines the structure of the taxonomy by assigning a parent π(u) to each label u ∈ U. [sent-92, score-0.961]

48 Semantically, π(u) is a more general label than u that contains u as a special case. [sent-93, score-0.201]

49 The ancestor function π ⋆ , maps each label to its set of ∞ ancestors, and is defined as π ⋆ (u) = n=0 {π n (u)}. [sent-106, score-0.382]

50 The inverse of the ancestor function is the descendent function τ , which maps u ∈ U to the subset {u′ ∈ U : u ∈ π ⋆ (u′ )}. [sent-109, score-0.247]

51 In other words, u is a descendent of u′ if and only if u′ is an ancestor of u. [sent-110, score-0.247]

52 The lowest common ancestor function λ : U × U → U maps any pair of labels to their lowest common ancestor in the taxonomy, where “lowest” is in the sense of tree depth. [sent-114, score-0.532]

53 In words, λ(u, u′ ) is the closest ancestor of u that is also an ancestor if u′ . [sent-116, score-0.362]

54 The leaves of a taxonomy are the labels that are not parents of any other labels. [sent-118, score-0.786]

55 Note that we assume that the labels that occur in the distribution are always leaves of the taxonomy T . [sent-122, score-0.786]

56 More formally, for each label u ∈ U \ Y, we add a new node y to U with π(y) = u, and whenever we sample (x, u) from D then we replace it with (x, y). [sent-124, score-0.201]

57 i=1 A classifier is a function f : X → U that assigns a label to each instance of X . [sent-127, score-0.201]

58 Note that a classifier is allowed to predict any label in U, even though it knows that only leaf labels are ever observed in the real world. [sent-128, score-0.301]

59 We feel that this property captures a fundamental characteristic of hierarchical classification: although the truth is always specific, a good hierarchical classifier will fall-back to a more general label when it cannot confidently give a specific prediction. [sent-129, score-0.569]

60 For any instance-label pair (x, y), the loss ℓ(f (x), y) should be interpreted as the penalty associated with predicting the label f (x) when the true label is y. [sent-131, score-0.56]

61 Another fundamental characteristic of hierarchical classification problems is that not all prediction errors are equally bad, and the definition of the loss should reflect this. [sent-134, score-0.342]

62 3 A Distribution-Calibrated Loss for Hierarchical Classification As mentioned above, we want to calibrate the hierarchical classification loss function using the distribution D, through its empirical proxy S. [sent-136, score-0.409]

63 In other words, we want D to differentiate between informative splits in the taxonomy and redundant ones. [sent-137, score-0.747]

64 We follow [8] in using graph-distance to define the loss function, but instead of setting all of the edge weights to 1, we define edge weights using D. [sent-138, score-0.226]

65 For each y ∈ Y, let p(y) be the marginal probability of the label y in the distribution D. [sent-139, score-0.201]

66 (1) Since this loss function depends only on u, y, and λ(u, y), and their frequencies according to D, it is completely invariant to the the number of labels along the path from u or y. [sent-146, score-0.296]

67 It is also invariant to inconsistent degrees of flatness of the taxonomy in different regions. [sent-147, score-0.679]

68 Now, define the risk of a classifier h as R(f ) = E(X,Y )∼D [ℓ(f (X), Y )], the expected loss over examples sampled from D. [sent-152, score-0.239]

69 However, before we tackle the problem of finding a low risk classifier, we address the intermediate task of estimating the risk of a given classifier f using the sample S. [sent-154, score-0.238]

70 First, we write the definition of risk more explicitly using the definition of the loss function in Eq. [sent-159, score-0.239]

71 Also define r(f, u) = Pr(λ(f (X), Y ) = u), the probability that the lowest common ancestor of f (X) and Y is u, when (X, Y ) is drawn from D. [sent-162, score-0.216]

72 This constant is simply H(Y ), the Shannon entropy [7] of the label distribution. [sent-165, score-0.201]

73 If f outputs a label u with p(u) = 0 then L1 will fail 1 The interested reader is referred to the extensive literature on the closely related problem of estimating the entropy of a distribution from a finite sample. [sent-186, score-0.23]

74 Formally, let β(f ) = minu:q(f,u)>0 p(u) be the smallest probability of any label that f outputs. [sent-189, score-0.201]

75 ¯ The estimator E[L1 |no-fail] is no longer an unbiased estimator of R(f ), but the bias is small. [sent-192, score-0.43]

76 The estimator L1 suffers from an unnecessarily high variance because it typically uses a short prefix of the sample S and wastes the remaining examples. [sent-202, score-0.208]

77 We reduce the variance of the estimator by repeating the estimation multiple times, without reusing any sample points. [sent-207, score-0.244]

78 and therefore the aggregate estimator L = t i=1 Li would also be an unbiased estimate of R(f Since L1 , . [sent-222, score-0.34]

79 , Lt are also independent, the variance of the aggregate estimator would be 1 Var(L1 ). [sent-225, score-0.272]

80 The precise definition of T should be handled with care, to ensure that the individual estimators remain independent and that the aggregate estimator maintains a small bias. [sent-234, score-0.279]

81 If L does not fail, define the aggregate estimator as L = T Li . [sent-241, score-0.218]

82 The probability of failure of the estimator L is at most e−lβ(f ) . [sent-244, score-0.202]

83 We note that a very similar theorem can be stated in the finite-sample case, 6 I NPUTS : a training set S = {(xi , yi )}m , a label taxonomy T . [sent-248, score-0.929]

84 , (m − 1)} : yψ(j) ∈ τ λ(u, yi ) 6 M (i, u) = 1 b+1 + 1 b+2 + ··· + 1 a O UTPUT: M Figure 2: A reduction from hierarchical multiclass to cost-sensitive multiclass. [sent-271, score-0.362]

85 The complication stems from the fact that we are estimating the risk of k classifiers simultaneously, and the failure of one estimator depends on the values of the other estimators. [sent-273, score-0.348]

86 To conduct a fair comparison of the k classifiers, Li (fj ) be the i’th unbiased estimator of R(f T we redefine T = minj=1,. [sent-286, score-0.276]

87 Moreover, 2 the variance of each L(fj )/E[T ] is Var (L(fj )/E[T ]) = σ /E[T ], so the effective variance of our unbiased estimate decreases like 1/E[T ], which is what we would expect. [sent-297, score-0.23]

88 More precisely, we describe a reduction from hierarchical classification to costsensitive multiclass classification. [sent-305, score-0.352]

89 This reduction is itself an algorithm whose input is a training set of m examples and a taxonomy over d labels, and whose output is a d × m matrix of non-negative reals, denoted by M . [sent-307, score-0.695]

90 Entry M (i, j) is the cost of classifying example i with label j. [sent-308, score-0.256]

91 This cost matrix, and the original training set, are given to a cost-aware multiclass learning algorithm, which attempts to m find a classifier f with a small empirical loss i=1 M (i, f (xi )). [sent-309, score-0.206]

92 7 For example, a common approach to multiclass problems is to train a model fu : X → R for each label u ∈ U and to define the classifier f (x) = arg maxu∈U fu (x). [sent-310, score-0.515]

93 An SVM-flavored way to train a cost sensitive classifier is to assume that the functions fu live in a Hilbert space, and to minimize d m fu u=1 2 M (i, u) + fu (xi ) − fyi (xi ) +C i=1 u=yi + , (4) where C > 0 is a parameter and [α]+ = max{0, α}. [sent-311, score-0.386]

94 Based on the analysis of the previous sections, it is easy to see that, for all i, M (i, f (xi )) is an ¯ unbiased estimator of the risk R(f ). [sent-315, score-0.395]

95 Therefore, m M (i, f (xi )) is also an unbiased ¯ estimator of R(f ). [sent-320, score-0.276]

96 We profess that a rigorous analysis of the variance of this estimator is missing from this work. [sent-323, score-0.208]

97 (4) by replacing d d the unstructured regularizer u=1 fu 2 with the structured regularizer u=1 fu −fπ(u) 2 , where π(u) is the parent label of u. [sent-329, score-0.665]

98 As a consequence, our focus was on the fundamental aspects of the hierarchical problem definition, rather than on the equally important algorithmic issues. [sent-332, score-0.222]

99 As mentioned in the introduction, the label taxonomy provides the perfect mechanism for backing off and predicting a more common and less risky ancestor of that label. [sent-338, score-1.071]

100 Acknowledgment The author would like to thank Paul Bennett for his suggestion of the loss function for its information theoretic properties, reduction to a tree-weighted distance, and ability to capture other desirable characteristics of hierarchical loss functions like weak monotonicity. [sent-343, score-0.468]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('taxonomy', 0.651), ('label', 0.201), ('hierarchical', 0.184), ('ancestor', 0.181), ('fj', 0.161), ('estimator', 0.154), ('classi', 0.154), ('vivaldi', 0.154), ('taxonomies', 0.132), ('unbiased', 0.122), ('loss', 0.12), ('risk', 0.119), ('fu', 0.114), ('parent', 0.109), ('labels', 0.1), ('music', 0.095), ('multiclass', 0.086), ('er', 0.078), ('var', 0.074), ('ontology', 0.071), ('descendent', 0.066), ('aggregate', 0.064), ('subjective', 0.062), ('estimators', 0.061), ('rooted', 0.059), ('hierarchy', 0.058), ('bad', 0.056), ('classifying', 0.055), ('variance', 0.054), ('edge', 0.053), ('calibrate', 0.053), ('want', 0.052), ('nition', 0.051), ('li', 0.05), ('words', 0.05), ('ers', 0.05), ('dekel', 0.049), ('yi', 0.048), ('path', 0.048), ('failure', 0.048), ('cation', 0.048), ('classical', 0.047), ('unbalanced', 0.045), ('reduction', 0.044), ('splits', 0.044), ('xi', 0.044), ('balkan', 0.044), ('ddc', 0.044), ('decimal', 0.044), ('fyi', 0.044), ('landing', 0.044), ('odp', 0.044), ('strip', 0.044), ('document', 0.043), ('non', 0.043), ('subtree', 0.043), ('formally', 0.042), ('arbitrary', 0.042), ('costsensitive', 0.038), ('lcc', 0.038), ('websites', 0.038), ('predicting', 0.038), ('equally', 0.038), ('decisions', 0.038), ('dependencies', 0.036), ('estimation', 0.036), ('namely', 0.036), ('mistakes', 0.035), ('thing', 0.035), ('jazz', 0.035), ('minu', 0.035), ('lowest', 0.035), ('leaves', 0.035), ('rare', 0.035), ('structured', 0.034), ('care', 0.034), ('unstructured', 0.033), ('gentile', 0.033), ('completeness', 0.033), ('congress', 0.033), ('estimations', 0.033), ('design', 0.032), ('go', 0.031), ('pr', 0.03), ('regularizer', 0.03), ('subclasses', 0.03), ('games', 0.029), ('theorem', 0.029), ('ne', 0.029), ('topic', 0.029), ('fail', 0.029), ('acyclic', 0.029), ('semantically', 0.029), ('gene', 0.028), ('bi', 0.028), ('invariant', 0.028), ('ten', 0.028), ('stems', 0.027), ('holds', 0.027), ('lt', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 71 nips-2009-Distribution-Calibrated Hierarchical Classification

Author: Ofer Dekel

Abstract: While many advances have already been made in hierarchical classification learning, we take a step back and examine how a hierarchical classification problem should be formally defined. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classification loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated definition of statistical risk for hierarchical classification, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classification to cost-sensitive classification.

2 0.10860839 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

3 0.097997263 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1

4 0.091493994 157 nips-2009-Multi-Label Prediction via Compressed Sensing

Author: John Langford, Tong Zhang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider multi-label prediction problems with large output spaces under the assumption of output sparsity – that the target (label) vectors have small support. We develop a general theory for a variant of the popular error correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multi-label regression problems to binary regression problems. We show that the number of subproblems need only be logarithmic in the total number of possible labels, making this approach radically more efficient than others. We also state and prove robustness guarantees for this method in the form of regret transform bounds (in general), and also provide a more detailed analysis for the linear prediction setting. 1

5 0.091488078 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

Abstract: We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classifier that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. 1

6 0.09127669 122 nips-2009-Label Selection on Graphs

7 0.091129124 72 nips-2009-Distribution Matching for Transduction

8 0.091025479 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

9 0.085973628 47 nips-2009-Boosting with Spatial Regularization

10 0.085202947 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

11 0.078780137 260 nips-2009-Zero-shot Learning with Semantic Output Codes

12 0.067266703 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

13 0.064936996 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

14 0.064308219 129 nips-2009-Learning a Small Mixture of Trees

15 0.06413839 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

16 0.063047491 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

17 0.062674291 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank

18 0.061730184 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

19 0.061239813 87 nips-2009-Exponential Family Graph Matching and Ranking

20 0.059890412 230 nips-2009-Statistical Consistency of Top-k Ranking


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.21), (1, 0.07), (2, -0.047), (3, 0.008), (4, -0.011), (5, -0.038), (6, -0.102), (7, -0.052), (8, -0.032), (9, 0.071), (10, 0.024), (11, 0.002), (12, -0.047), (13, -0.064), (14, 0.116), (15, 0.082), (16, 0.026), (17, -0.051), (18, 0.13), (19, 0.024), (20, 0.103), (21, -0.037), (22, -0.051), (23, 0.016), (24, 0.015), (25, 0.016), (26, 0.001), (27, 0.014), (28, 0.057), (29, -0.089), (30, 0.021), (31, -0.002), (32, 0.003), (33, 0.126), (34, -0.045), (35, 0.012), (36, 0.13), (37, 0.057), (38, 0.005), (39, 0.004), (40, -0.005), (41, 0.062), (42, -0.059), (43, 0.018), (44, -0.066), (45, -0.011), (46, 0.025), (47, 0.005), (48, -0.015), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95506483 71 nips-2009-Distribution-Calibrated Hierarchical Classification

Author: Ofer Dekel

Abstract: While many advances have already been made in hierarchical classification learning, we take a step back and examine how a hierarchical classification problem should be formally defined. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classification loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated definition of statistical risk for hierarchical classification, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classification to cost-sensitive classification.

2 0.70313275 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

Abstract: We address the problem of learning classifiers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classifiers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classifiers learned on examples with multiple artificially created views. Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. An extension of this framework is a natural way to leverage unlabeled multi-view data in semi-supervised learning. Experimental results on a subset of the Reuters RCV1/RCV2 collections support our findings by showing that additional views obtained from MT may significantly improve the classification performance in the cases identified by our trade-off. 1

3 0.63908726 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

Abstract: We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classifier that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. 1

4 0.61149162 122 nips-2009-Label Selection on Graphs

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i.e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets. 1

5 0.60851532 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Author: Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han

Abstract: Ensemble classifiers such as bagging, boosting and model averaging are known to have improved accuracy and robustness over a single model. Their potential, however, is limited in applications which have no access to raw data but to the meta-level model output. In this paper, we study ensemble learning with output from multiple supervised and unsupervised models, a topic where little work has been done. Although unsupervised models, such as clustering, do not directly generate label prediction for each individual, they provide useful constraints for the joint prediction of a set of related objects. We propose to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the prediction over the graph, as well as penalizing deviations from the initial labeling provided by supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes. Our method can also be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives1 . 1

6 0.60805368 72 nips-2009-Distribution Matching for Transduction

7 0.60609996 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

8 0.59764272 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

9 0.5887112 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

10 0.56458986 157 nips-2009-Multi-Label Prediction via Compressed Sensing

11 0.5532189 47 nips-2009-Boosting with Spatial Regularization

12 0.54402399 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

13 0.54036212 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

14 0.53378415 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

15 0.53337246 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

16 0.5226301 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

17 0.52114886 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

18 0.51822311 193 nips-2009-Potential-Based Agnostic Boosting

19 0.50534838 94 nips-2009-Fast Learning from Non-i.i.d. Observations

20 0.5038178 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.022), (24, 0.046), (25, 0.095), (35, 0.05), (36, 0.121), (39, 0.051), (52, 0.203), (58, 0.085), (61, 0.045), (71, 0.073), (86, 0.095), (91, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83988798 71 nips-2009-Distribution-Calibrated Hierarchical Classification

Author: Ofer Dekel

Abstract: While many advances have already been made in hierarchical classification learning, we take a step back and examine how a hierarchical classification problem should be formally defined. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classification loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated definition of statistical risk for hierarchical classification, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classification to cost-sensitive classification.

2 0.73202407 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

3 0.73138338 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

Abstract: A score function induced by a generative model of the data can provide a feature vector of a fixed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a fixed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classifiers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classifiers, or the discriminative classifiers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classification performance that at least matches the performance of the free energy classifier based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classifiers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.

4 0.72975981 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

Author: Ruslan Salakhutdinov

Abstract: Markov random fields (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.

5 0.72827035 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

Author: Kai Yu, Tong Zhang, Yihong Gong

Abstract: This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point x on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. 1

6 0.72794843 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

7 0.72613066 113 nips-2009-Improving Existing Fault Recovery Policies

8 0.72458875 260 nips-2009-Zero-shot Learning with Semantic Output Codes

9 0.7211619 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

10 0.72039437 72 nips-2009-Distribution Matching for Transduction

11 0.72038555 112 nips-2009-Human Rademacher Complexity

12 0.72008133 129 nips-2009-Learning a Small Mixture of Trees

13 0.71961421 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

14 0.71902633 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

15 0.71870255 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

16 0.71852738 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

17 0.71785098 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

18 0.7176435 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

19 0.71726686 137 nips-2009-Learning transport operators for image manifolds

20 0.71707684 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes