nips nips2004 nips2004-37 knowledge-graph by maker-knowledge-mining

37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice


Source: pdf

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. [sent-13, score-0.203]

2 It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. [sent-14, score-0.058]

3 In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. [sent-15, score-0.208]

4 This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. [sent-16, score-0.322]

5 Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. [sent-18, score-0.113]

6 1 Introduction In machine learning, it is often the case that unlabeled data is substantially cheaper and more plentiful than labeled data, and as a result a number of methods have been developed for using unlabeled data to try to improve performance, e. [sent-19, score-0.205]

7 Co-training [2] is a method that has had substantial success in scenarios in which examples can be thought of as containing two distinct yet sufficient feature sets. [sent-22, score-0.059]

8 Intuitively, this means that each example contains two “views,” and each view contains sufficient information to determine the label of the example. [sent-25, score-0.082]

9 This redundancy implies an underlying structure of the unlabeled data (since they need to be “consistent”), and this structure makes the unlabeled data informative. [sent-26, score-0.21]

10 The second is that these two views should on the other hand not be too highly correlated — we need to have at least some examples where h1 is confident but h2 is not (or vice versa) for the co-training algorithm to actually do anything. [sent-33, score-0.154]

11 Unfortunately, previous theoretical analyses have needed to make strong assumptions of this second type in order to prove their guarantees. [sent-34, score-0.081]

12 These include “conditional independence given the label” used by [2] and [4], or the assumption of “weak rule dependence” used by [1]. [sent-35, score-0.065]

13 However, we will need a fairly strong assumption on the learning algorithms: that the hi they produce are never “confident but wrong” (formally, the algorithms are able to learn from positive data only), though we give a heuristic analysis of the case when this does not hold. [sent-37, score-0.251]

14 One key feature of assuming only expansion on the data is that it specifically motivates the iterative nature of the co-training algorithm. [sent-38, score-0.21]

15 Previous assumptions that had been analyzed imply such a strong form of expansion that even a “one-shot” version of co-training will succeed (see Section 2. [sent-39, score-0.217]

16 That is, if we have sufficiently strong efficient PAC-learning algorithms for the target function on each feature set, we can use them to achieve efficient PAC-style guarantees for co-training as well. [sent-44, score-0.053]

17 We begin by formally defining the expansion assumption we will use, connecting it to standard graph-theoretic notions of expansion and conductance. [sent-47, score-0.291]

18 We then prove the statement that -expansion is sufficient for iterative co-training to succeed, given strong enough base learning algorithms over each view, proving bounds on the number of iterations needed to converge. [sent-48, score-0.108]

19 2, we present experiments on synthetic expander graph data that qualitatively bear out our analyses. [sent-52, score-0.136]

20 Let c denote the target function, and let X + and X − denote the positive and negative regions of X respectively (for simplicity we assume we are doing binary classification). [sent-54, score-0.086]

21 For most of this paper we assume that each view in itself is sufficient for correct classification; that is, c can be decomposed into functions c1 , c2 over each view such that D has no probability + mass on examples x such that c1 (x1 ) = c2 (x2 ). [sent-55, score-0.159]

22 For i ∈ {1, 2}, let Xi = {xi ∈ Xi : + + − + ci (xi ) = 1}, so we can think of X + as X1 × X2 , and let Xi = Xi − Xi . [sent-56, score-0.056]

23 In order to discuss iterative co-training, we need to be able to talk about a hypothesis being confident or not confident on a given example. [sent-58, score-0.095]

24 This means we can think of a hypothesis hi as a subset of Xi , where xi ∈ hi means that hi is confident that xi is positive, and xi ∈ hi means that hi has no opinion. [sent-60, score-0.751]

25 As in [2], we will abstract away the initialization phase of co-training (how labeled data is + 0 used to generate an initial hypothesis) and assume we are given initial sets S1 ⊆ X1 and + 0 0 0 S2 ⊆ X2 such that Pr x1 ,x2 ∈D (x1 ∈ S1 or x2 ∈ S2 ) ≥ ρinit for some ρinit > 0. [sent-61, score-0.059]

26 The goal of co-training will be to bootstrap from these sets using unlabeled data. [sent-62, score-0.109]

27 Now, to prove guarantees for iterative co-training, we make two assumptions: that the learning algorithms used in each of the two views are able to learn from positive data only, and that the distribution D+ is expanding as defined in Section 2. [sent-63, score-0.29]

28 1 Assumption about the base learning algorithms on the two views We assume that the learning algorithms on each view are able to PAC-learn from positive + + data only. [sent-66, score-0.197]

29 Specifically, for any distribution Di over Xi , and any given , δ > 0, given + access to examples from Di the algorithm should be able to produce a hypothesis hi such + that (a) hi ⊆ Xi (so hi only has one-sided error), and (b) with probability 1−δ, the error of + hi under Di is at most . [sent-67, score-0.545]

30 Examples of concept classes learnable from positive data only include conjunctions, k-CNF, and axisparallel rectangles; see [7]. [sent-69, score-0.084]

31 For instance, for the case of axis-parallel rectangles, a simple algorithm that achieves this guarantee is just to output the smallest rectangle enclosing the positive examples seen. [sent-70, score-0.099]

32 In addition, a nice feature of our assumption is that we will only need D+ to expand and not D− . [sent-73, score-0.075]

33 This is especially natural if the positive class has a large amount of cohesion (e. [sent-74, score-0.058]

34 g, it consists of all documents about some topic Y ) but the negatives do not (e. [sent-75, score-0.052]

35 2 The expansion assumption for the underlying distribution For S1 ⊆ X1 and S2 ⊆ X2 , let boldface Si (i = 1, 2) denote the event that an example x1 , x2 has xi ∈ Si . [sent-80, score-0.168]

36 So, if we think of S1 and S2 as our confident sets in each view, then Pr(S1 ∧ S2 ) denotes the probability mass on examples for which we are confident about both views, and Pr(S1 ⊕ S2 ) denotes the probability mass on examples for which we are confident about just one. [sent-81, score-0.188]

37 We say that D+ is -expanding with respect to hypothesis class H1 × H2 if the above + + + holds for all S1 ∈ H1 ∩ X1 , S2 ∈ H2 ∩ X2 (here we denote by Hi ∩ Xi the set + h ∩ Xi : h ∈ Hi for i = 1, 2). [sent-84, score-0.064]

38 To get a feel for this definition, notice that -expansion is in some sense necessary for iterative co-training to succeed, because if S1 and S2 are our confident sets and do not expand, then we might never see examples for which one hypothesis could help the other. [sent-85, score-0.199]

39 To see how much weaker this definition is than previously-considered requirements, it is helpful to consider a slightly stronger kind of expansion that we call “left-right expansion”. [sent-87, score-0.166]

40 1 However, -expansion requires every pair to expand and so it is not strictly necessary. [sent-89, score-0.055]

41 It is not immediately obvious but left-right expansion in fact implies Definition 1 (see Appendix A), though the converse is not necessarily true. [sent-93, score-0.153]

42 Secondly, this notion helps clarify how our assumptions are much less restrictive than those considered previously. [sent-96, score-0.049]

43 Specifically, Independence given the label: Independence given the label implies that for any S1 ⊆ + + X1 and S2 ⊆ X2 we have Pr(S2 |S1 ) = Pr(S2 ). [sent-97, score-0.077]

44 This means that not only does S1 + expand by a (1 + ) factor as in Def. [sent-99, score-0.055]

45 Weak dependence: Weak dependence [1] is a relaxation of conditional independence that + + requires only that for all S1 ⊆ X1 , S2 ⊆ X2 we have Pr(S2 |S1 ) ≥ α Pr(S2 ) for some α > 0. [sent-101, score-0.089]

46 However, notice that if Pr(S2 |S1 ) ≥ 1 − , then Pr(S2 |S1 ) ≤ , which implies by definition of weak dependence that Pr(S2 ) ≤ /α and therefore Pr(S2 ) ≥ 1 − /α. [sent-103, score-0.127]

47 1 Connections to standard graph-theoretic notions of expansion Our definition of -expansion (Definition 1) is a natural analog of the standard graphtheoretic notion of edge-expansion or conductance. [sent-108, score-0.16]

48 A Markov-chain is said to have high conductance if under the stationary distribution, for any set of states S of probability at most 1/2, the probability mass on transitions exiting S is at least times the probability of S. [sent-109, score-0.068]

49 It is well-known that, for example, a random degree-3 bipartite graph with high probability is expanding, and this in fact motivates our synthetic data experiments of Section 4. [sent-116, score-0.116]

50 2 Examples We now give two simple examples that satisfy -expansion but not weak dependence. [sent-120, score-0.084]

51 Suppose a random positive example from D+ looks like a pair x1 , x2 such that x1 and x2 are each uniformly distributed in their rectangles but in a highly-dependent way: specifically, x2 is identical to x1 except that a random coordinate has been “re-randomized” within the rectangle. [sent-122, score-0.094]

52 This distribution does not satisfy weak dependence (for any sets S and T that are disjoint along all axes we have Pr(T|S) = 0) but it is not hard to verify that D+ is -expanding for = Ω(1/d). [sent-123, score-0.092]

53 Example 2: Imagine that we have a learning problem such that the data in X1 falls into n different clusters: the positive class is the union of some of these clusters and the negative class is the union of the others. [sent-124, score-0.119]

54 Independence given the label would say that given that x1 is in some positive cluster Ci in X1 , x2 is equally likely to be in any of the positive clusters Cj in X2 . [sent-126, score-0.21]

55 This distribution clearly will not even have the weak dependence property. [sent-130, score-0.068]

56 However, say we have a learning algorithm that assumes everything in the same cluster has the same label (so the hypothesis space H consists of all rules that do not split clusters). [sent-131, score-0.101]

57 Then if the graph of which clusters are associated with which is an expander graph, then the distributions will be expanding with respect to H. [sent-132, score-0.153]

58 In particular, given a labeled example x, the learning algorithm will generalize to x’s entire cluster Ci , then this will be propagated over to nodes in the associated clusters Cj in X2 , and so on. [sent-133, score-0.068]

59 i i The iterative co-training that we consider proceeds in rounds. [sent-136, score-0.055]

60 Let S1 ⊆ X1 and S2 ⊆ X2 i+1 be the confident sets in each view at the start of round i. [sent-137, score-0.069]

61 That is, we take unlabeled 1 2 examples from D such that at least one of the current predictors is confident, and feed them into A2 as if they were positive. [sent-139, score-0.145]

62 After a pre-determined number of rounds N (specified in Theorem 1), the algorithm termiN nates and outputs the predictor that labels examples x1 , x2 as positive if x1 ∈ S1 +1 or N +1 and negative otherwise. [sent-142, score-0.18]

63 Since Pr (S1 ∧ S2 ) ≤ Pr (S1 ∧ S2 ) it follows from the expansion property that Pr (S1 ∨ S2 ) = Pr (S1 ⊕ S2 ) + Pr (S1 ∧ S2 ) ≥ (1 + ) Pr (S1 ∧ S2 ). [sent-149, score-0.133]

64 Since Pr (S1 ∧ S2 ) > Pr (S1 ∧ S2 ) it follows 4 from the expansion property that Pr (S1 ⊕ S2 ) ≥ Pr (S1 ∧ S2 ). [sent-154, score-0.133]

65 8 γ 4 )(1 − γ 1+ ) ≥ Theorem 1 Let f in and δf in be the (final) desired accuracy and confidence parameters. [sent-158, score-0.052]

66 Then we can achieve error rate f in with probability 1 − δf in by running co-training for 1 N = O( 1 log f1 + 1 · ρinit ) rounds, each time running A1 and A2 with accuracy and in confidence parameters set to · f in 8 and δf in 2N respectively. [sent-159, score-0.052]

67 + + i i Proof Sketch: Assume that, for i ≥ 1, S1 ⊆ X1 and S2 ⊆ X2 are the confident sets in each view after step i − 1 of co-training. [sent-160, score-0.069]

68 Define pi = Pr (Si ∧ Si ), qi = Pr (Si ∧ Si ), 1 2 1 2 and γi = 1 − pi , with all probabilities with respect to D+ . [sent-161, score-0.093]

69 We are interested in bounding Pr (Si ∨ Si ), but since technically it is easier to bound Pr (Si ∧ Si ), we will instead show 1 2 1 2 that pN ≥ 1 − f in with probability 1 − δf in , which obviously implies that Pr(SN ∨ SN ) 1 2 is at least as good. [sent-162, score-0.059]

70 If pi ≤ qi , since with probability 1 − f in we have N Pr (Si+1 | Si ∨ Si ) ≥ 1 − 8 and Pr (Si+1 | Si ∨ Si ) ≥ 1 − 8 , using lemma 1 we obtain 1 2 1 2 1 2 δ that with probability 1 − f in , we have Pr (Si+1 ∧ Si+1 ) ≥ (1 + /2) Pr (Si ∧ Si ). [sent-166, score-0.076]

71 Sim1 2 1 2 N ilarly, by applying lemma 2, we obtain that if pi > qi and γi ≥ f in then with probability δ i 1 − f in we have Pr (Si+1 ∧ Si+1 ) ≥ (1 + γ8 ) Pr (Si ∧ Si ). [sent-167, score-0.076]

72 At this point, notice that every 8/ rounds, γ 1 drops by at least a factor of 2; that is, if γi ≤ 21 then γ 8 +i ≤ 2k+1 . [sent-171, score-0.061]

73 So, after a total k 1 1 1 1 of O( log f in + · ρinit ) rounds, we have a predictor of the desired accuracy with the desired confidence. [sent-172, score-0.052]

74 In addition, we have assumed that given correctly-labeled positive examples as input, our learning algorithms are able to generalize in a way that makes only 1-sided error (i. [sent-174, score-0.099]

75 In this section we give a heuristic analysis of the case when these assumptions are relaxed, along with several synthetic experiments on expander graphs. [sent-177, score-0.147]

76 1 Heuristic Analysis of Error propagation i i Given confident sets S1 ⊆ X1 and S2 ⊆ X2 at the ith iteration, let us define their purity (precision) as puri = PrD (c(x) = 1|Si ∨ Si ) and their coverage (recall) to be 1 2 covi = PrD (Si ∨ Si |c(x) = 1). [sent-179, score-0.315]

77 Let us also define their “opposite coverage” to be 1 2 oppi = PrD (Si ∨Si |c(x) = 0). [sent-180, score-0.196]

78 Previously, we assumed oppi = 0 and therefore puri = 1. [sent-181, score-0.222]

79 (1) (2) 1 1 accuracy on negative accuracy on positive overall accuracy 0. [sent-183, score-0.261]

80 8 accuracy on negative accuracy on positive overall accuracy 0. [sent-184, score-0.261]

81 2 0 1 2 3 4 5 iteration 6 7 8 0 2 4 6 iteration 8 10 0 accuracy on negative accuracy on positive overall accuracy 2 4 6 8 10 12 14 iteration Figure 1: Co-training with noise rates 0. [sent-195, score-0.261]

82 Solid line indicates overall accuracy; green (dashed, increasing) curve is accuracy on positives (covi ); red (dashed, decreasing) curve is accuracy on negatives (1 − oppi ). [sent-199, score-0.351]

83 That is, this corresponds to both the positive and negative parts of the confident region expanding in the way given in the proof of Theorem 1, with an η fraction of the new edges going to examples of the other label. [sent-200, score-0.185]

84 First, initially when coverage is low, every O(1/ ) steps we get roughly cov ← 2 · cov and opp ← 2 · opp + η · cov. [sent-202, score-0.253]

85 So, we expect coverage to increase exponentially and purity to drop linearly. [sent-203, score-0.115]

86 However, once coverage gets large and begins to saturate, if purity is still high at this time it will begin dropping rapidly as the exponential increase in oppi causes oppi to catch up with covi . [sent-204, score-0.676]

87 In particular, a calculation (omitted) shows that if D is 50/50 positive and negative, then overall accuracy increases up to the point when covi + oppi = 1, and then drops from then on. [sent-205, score-0.498]

88 Nodes 1 to n on each side represent positive clusters, and nodes n + 1 to 2n on each side represent negative clusters. [sent-211, score-0.086]

89 We begin with an initial confident set S1 ⊆ X1 and then propagate confidence through rounds of co-training, monitoring the percentage of the positive class covered, the percent of the negative class mistakenly covered, and the overall accuracy. [sent-213, score-0.177]

90 As can be seen, these qualitatively match what we expect: coverage increases exponentially, but accuracy on negatives (1 − oppi ) drops exponentially too, though somewhat delayed. [sent-218, score-0.403]

91 At some point there is a crossover where covi = 1 − oppi , which as predicted roughly corresponds to the point at which overall accuracy starts to drop. [sent-219, score-0.417]

92 5 Conclusions Co-training is a method for using unlabeled data when examples can be partitioned into two views such that (a) each view in itself is at least roughly sufficient to achieve good classification, and yet (b) the views are not too highly correlated. [sent-220, score-0.378]

93 Previous theoretical work has required instantiating condition (b) in a very strong sense: as independence given the label, or a form of weak dependence. [sent-221, score-0.118]

94 In this work, we argue that the “right” condition is something much weaker: an expansion property on the underlying distribution (over positive examples) that we show is sufficient and to some extent necessary as well. [sent-222, score-0.191]

95 The expansion property is especially interesting because it directly motivates the iterative nature of many of the practical co-training based algorithms, and our work is the first rigorous analysis of iterative co-training in a setting that demonstrates its advantages over one-shot versions. [sent-223, score-0.285]

96 Combining labeled and unlabeled data for text classification with a large number of categories. [sent-252, score-0.143]

97 Text classification from labeled and unlabeled documents using em. [sent-289, score-0.14]

98 Large scale unstructured document classification using unlabeled data and syntactic information. [sent-294, score-0.085]

99 Theorem 2 If D+ satisfies -left-right expansion (Definition 2), then it also satisfies (Definition 1) for = /(1 + ). [sent-321, score-0.113]

100 Similarly if Pr(S2 ) ≤ Pr(S1 ) we get a failure of expansion in the other direction. [sent-330, score-0.133]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pr', 0.806), ('dent', 0.248), ('si', 0.201), ('oppi', 0.196), ('covi', 0.15), ('hi', 0.116), ('expansion', 0.113), ('init', 0.105), ('views', 0.094), ('unlabeled', 0.085), ('con', 0.079), ('coverage', 0.079), ('expander', 0.06), ('positive', 0.058), ('iterative', 0.055), ('expand', 0.055), ('rounds', 0.053), ('accuracy', 0.052), ('succeed', 0.046), ('nition', 0.046), ('view', 0.045), ('independence', 0.045), ('opp', 0.045), ('prd', 0.045), ('weak', 0.043), ('motivates', 0.042), ('examples', 0.041), ('hypothesis', 0.04), ('implies', 0.04), ('label', 0.037), ('pi', 0.037), ('expanding', 0.037), ('purity', 0.036), ('rectangles', 0.036), ('labeled', 0.035), ('xi', 0.035), ('weaker', 0.034), ('clusters', 0.033), ('synthetic', 0.032), ('classi', 0.032), ('cov', 0.032), ('negatives', 0.032), ('page', 0.031), ('strong', 0.03), ('ci', 0.03), ('cj', 0.029), ('negative', 0.028), ('suppose', 0.028), ('assumptions', 0.028), ('mass', 0.028), ('dence', 0.027), ('heuristic', 0.027), ('think', 0.026), ('puri', 0.026), ('avrim', 0.026), ('learnable', 0.026), ('notions', 0.026), ('dependence', 0.025), ('say', 0.024), ('heuristically', 0.024), ('rivest', 0.024), ('lemmas', 0.024), ('sets', 0.024), ('imagine', 0.023), ('suf', 0.023), ('prove', 0.023), ('mellon', 0.023), ('drops', 0.023), ('carnegie', 0.023), ('text', 0.023), ('guarantees', 0.023), ('graph', 0.023), ('expands', 0.022), ('nlp', 0.022), ('pittsburgh', 0.022), ('notion', 0.021), ('proof', 0.021), ('qualitatively', 0.021), ('pages', 0.021), ('conductance', 0.021), ('property', 0.02), ('assumption', 0.02), ('documents', 0.02), ('lemma', 0.02), ('nigam', 0.02), ('get', 0.02), ('begin', 0.019), ('least', 0.019), ('stronger', 0.019), ('qi', 0.019), ('notice', 0.019), ('bipartite', 0.019), ('blum', 0.019), ('entity', 0.019), ('linguistics', 0.019), ('meeting', 0.019), ('cally', 0.019), ('overall', 0.019), ('conditional', 0.019), ('thought', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

2 0.23390251 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Author: Pascal Poupart, Craig Boutilier

Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

3 0.13206841 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

Abstract: We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1

4 0.12024547 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

Author: Mark Craven, Joseph Bockhorst

Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1

5 0.10503884 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

Author: Daniela D. Farias, Nimrod Megiddo

Abstract: A reactive environment is one that responds to the actions of an agent rather than evolving obliviously. In reactive environments, experts algorithms must balance exploration and exploitation of experts more carefully than in oblivious ones. In addition, a more subtle definition of a learnable value of an expert is required. A general exploration-exploitation experts method is presented along with a proper definition of value. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven. 1

6 0.094623215 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

7 0.090571873 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

8 0.0860238 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

9 0.084393397 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

10 0.084254943 164 nips-2004-Semi-supervised Learning by Entropy Minimization

11 0.076141931 69 nips-2004-Fast Rates to Bayes for Kernel Machines

12 0.065775447 23 nips-2004-Analysis of a greedy active learning strategy

13 0.065324873 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

14 0.064730033 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

15 0.057784684 54 nips-2004-Distributed Information Regularization on Graphs

16 0.056108233 161 nips-2004-Self-Tuning Spectral Clustering

17 0.048943877 130 nips-2004-Newscast EM

18 0.046500102 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

19 0.042631213 122 nips-2004-Modelling Uncertainty in the Game of Go

20 0.040877305 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.152), (1, 0.051), (2, 0.07), (3, 0.029), (4, -0.005), (5, 0.151), (6, 0.009), (7, 0.145), (8, -0.04), (9, -0.05), (10, 0.044), (11, 0.024), (12, -0.026), (13, 0.11), (14, -0.039), (15, -0.003), (16, 0.03), (17, 0.092), (18, 0.03), (19, 0.017), (20, -0.039), (21, 0.195), (22, -0.051), (23, -0.063), (24, -0.202), (25, 0.175), (26, -0.113), (27, 0.061), (28, -0.08), (29, 0.336), (30, 0.082), (31, -0.113), (32, 0.068), (33, -0.007), (34, -0.138), (35, 0.156), (36, -0.048), (37, 0.071), (38, 0.014), (39, -0.089), (40, 0.107), (41, -0.046), (42, -0.009), (43, -0.106), (44, 0.061), (45, 0.05), (46, -0.016), (47, -0.034), (48, 0.031), (49, 0.103)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97512037 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

2 0.65561604 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Author: Pascal Poupart, Craig Boutilier

Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

3 0.5979141 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

Author: Mark Craven, Joseph Bockhorst

Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1

4 0.51444459 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

Abstract: We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1

5 0.49643221 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

Author: Daniela D. Farias, Nimrod Megiddo

Abstract: A reactive environment is one that responds to the actions of an agent rather than evolving obliviously. In reactive environments, experts algorithms must balance exploration and exploitation of experts more carefully than in oblivious ones. In addition, a more subtle definition of a learnable value of an expert is required. A general exploration-exploitation experts method is presented along with a proper definition of value. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven. 1

6 0.38346583 130 nips-2004-Newscast EM

7 0.36804649 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

8 0.36469078 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

9 0.30432144 122 nips-2004-Modelling Uncertainty in the Game of Go

10 0.29647273 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

11 0.27565548 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

12 0.26917064 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

13 0.25721848 164 nips-2004-Semi-supervised Learning by Entropy Minimization

14 0.25716639 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

15 0.25337884 54 nips-2004-Distributed Information Regularization on Graphs

16 0.25050259 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

17 0.23856497 82 nips-2004-Incremental Algorithms for Hierarchical Classification

18 0.23794304 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

19 0.23405647 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

20 0.23182349 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.068), (15, 0.087), (17, 0.019), (26, 0.472), (31, 0.031), (33, 0.141), (35, 0.018), (39, 0.013), (50, 0.027), (87, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97013509 185 nips-2004-The Convergence of Contrastive Divergences

Author: Alan L. Yuille

Abstract: This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufficient conditions for the solution to be unbiased.

2 0.95602077 34 nips-2004-Breaking SVM Complexity with Cross-Training

Author: Léon Bottou, Jason Weston, Gökhan H. Bakir

Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1

3 0.94624311 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters

Author: Evan C. Smith, Michael S. Lewicki

Abstract: The representation of acoustic signals at the cochlear nerve must serve a wide range of auditory tasks that require exquisite sensitivity in both time and frequency. Lewicki (2002) demonstrated that many of the filtering properties of the cochlea could be explained in terms of efficient coding of natural sounds. This model, however, did not account for properties such as phase-locking or how sound could be encoded in terms of action potentials. Here, we extend this theoretical approach with algorithm for learning efficient auditory codes using a spiking population code. Here, we propose an algorithm for learning efficient auditory codes using a theoretical model for coding sound in terms of spikes. In this model, each spike encodes the precise time position and magnitude of a localized, time varying kernel function. By adapting the kernel functions to the statistics natural sounds, we show that, compared to conventional signal representations, the spike code achieves far greater coding efficiency. Furthermore, the inferred kernels show both striking similarities to measured cochlear filters and a similar bandwidth versus frequency dependence. 1

same-paper 4 0.92602158 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

5 0.75658625 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

Author: Alan L. Yuille

Abstract: This paper analyzes generalization of the classic Rescorla-Wagner (RW) learning algorithm and studies their relationship to Maximum Likelihood estimation of causal parameters. We prove that the parameters of two popular causal models, ∆P and P C, can be learnt by the same generalized linear Rescorla-Wagner (GLRW) algorithm provided genericity conditions apply. We characterize the fixed points of these GLRW algorithms and calculate the fluctuations about them, assuming that the input is a set of i.i.d. samples from a fixed (unknown) distribution. We describe how to determine convergence conditions and calculate convergence rates for the GLRW algorithms under these conditions.

6 0.64179653 69 nips-2004-Fast Rates to Bayes for Kernel Machines

7 0.63577831 48 nips-2004-Convergence and No-Regret in Multiagent Learning

8 0.62994605 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

9 0.62427127 68 nips-2004-Face Detection --- Efficient and Rank Deficient

10 0.62213248 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

11 0.61365741 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

12 0.60838807 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

13 0.60319287 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

14 0.59746677 103 nips-2004-Limits of Spectral Clustering

15 0.59486651 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

16 0.58861268 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

17 0.58774412 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern

18 0.57654124 28 nips-2004-Bayesian inference in spiking neurons

19 0.57436687 153 nips-2004-Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity

20 0.57102472 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill